Archive for the ‘EnronData.org’ Category.

The Mailbox PST Dataset

Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is similar to journal email in that per-user delineation and folder structure of the user email stores have been removed.

To preserve the user information associated with the email, EnronData.org is now offering the CALO Enron Email Dataset in the form of 148 PST files with folder structure, preserving the information in the CALO dataset. Email for each of the 148 identified custodians is available a individual per-custodian PST files. A few minor changes were made to correct names and merge duplicate users where both correct and incorrect names existed. Custom X-headers have also been added including unique IDs to facilitate testing.

The files are currently available for download from the EnronData.org homepage as a 734MB 7z archive. 7z is an archival format similar to ZIP, BZIP2 and RAR that generally achieves higher compression rates. The uncompressed size for this dataset is roughly 8.6 gigabytes.

This dataset is licensed under the Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to “EnronData.org.”

Update 1: If you are experiencing difficulties downloading the file, try using wxDFast, a free open source download manager.

Update 2: Bandwidth management has been implemented. This is likely set too conservatively right now and will be adjusted up soon.

Custodian Names and Titles

In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc.

Various datasets have used a string consisting generally of lastname and first initial to identify custodians. This ID appears in the FERC dataset as the ORIGIN field and in the CALO dataset as a directory name.

Using this ID, different versions of the Enron email dataset use different numbers of users. Notably, the following numbers of users are used:

  1. 158 users: The FERC dataset identifies 158 unique users using the iCONECT ORIGIN database column.
  2. 150 users: The CALO dataset identifies 150 unique users using the maildir user directory structure.
  3. 149 users: Andrés Corrada-Emmanuel has identified 149 unique users noting that phanis-s is a misspelling of panus-s.
  4. 148 users: EnronData.org has identified 148 unique users noting that whalley-l is a duplicate of whalley-g, both representing Lawrence “Greg” Whalley.

EnronData.org has verified the duplicates identified by CALO, identified two more duplicates, and corrected two misspellings as shown in this Enron custodian list. This list was created before analyzing Corrada-Emmanuel’s custodian list which correctly identified one of additional duplicates.

After identifying custodians by ID, additional information can be associated with the custodians including names, ranks, titles, email addresses etc. A large part of the work has been performed by Jitesh Shetty and Jafar Adibi in their Ex-Employee Status Report. Combining this data with some additional data allows us to associate this information directly with the custodians in the dataset. The final results are available in the this custodian information report. Corrada-Emmanuel has also created a custodian ID to email address mapping. Email addresses will be incorporated by EnronData.org at a future date.

Deduplication and Attachment Stripping – Reducing the Dataset

While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset.

One of the challenges with deduping is which duplicate do you remove and do you leave a link behind? For example, if Alice sends a message to Bob, it will typically exist in at least 3 places, in Alice’s Sent folder, in Alice’s Inbox and in Bob’s Inbox. If you were to remove two of those, which two would you remove and how representative would the resulting dataset be?

There seem to be three solutions to this depending on which problem you are solving.

  1. Single-Instance Storage (Loss-less Storage Reduction): If storage is a problem, email archiving solutions solve this through SIS, or Single-Instance Storage where multiple copies of an email are stored only once. Emails on the mail server can be replaced by stubs which are emails where the body consists of a pointer to the full email. This way all records are accounted for but the storage costs are dramatically reduced for duplicates.
  2. Attachment Elimination (Lossy Storage Reduction): Dataset size can be reduced even more by elimination attachments entirely while including attachment information either in the header or body. This can be created to simulate email with Attachment Stubbing where the attachments are no longer available.
  3. Journal Email (Duplicate Elimination): If the goal is to eliminate duplicates entirely, maintaining user mailbox folders becomes problematic because certain folders will be missing email while others will not be. One way to address this problem is to eliminate folders entirely and move all the email into one or a few global folders, say organized by date instead of user. This is similar to how email archiving via journaling works. With journaling, a copy of each email sent or received, often eliminating duplicates, making it seem like an ideal approach for this requirement.

By using the above approaches, we can reduce the number of duplicates and the storage requirements while maintaining the characteristics of real world email datasets. I’ve added these to the proposed dataset list for consideration. Please let me know if you think these or other datasets would be of use. For now, I’ve added these to the projects page.