I’m pleased to announce that an initial version of the EDRM Enron Email Data Set consisting of 40GB of PST files with attachments and folder structure is now available within the EDRM project as of the EDRM 2009-2010 Kick-Off Meeting. The EDRM Data Set Project is now working to make this data set publicly available.
This initial data set was created by myself and a team at ZL Technologies; however, more work remains and I think the EDRM Data Set project is an ideal group to head up the effort to publish some industry standard data sets.
Some of the issues that the EDRM Data Set Project will be looking at include addressing privacy concerns, the publishing of smaller data set slices, and distribution methods for large data sets. If you would like to participate in this process, please join EDRM.
EDRM Data Set Project Lead
A number of people have contacted me about getting the current PST corpus via an alternative manner. This is partially due to the bandwidth restrictions that have been in place for the HTTP download. I planned to put in some other download methods but haven’t had time yet. Until then, if you will be at the EDRM Kick-Off meeting and you would still like a copy, bring a 1+ GB USB key and find me at the meeting. If you are interested, please let me know beforehand so I can plan ahead.
Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is similar to journal email in that per-user delineation and folder structure of the user email stores have been removed.
To preserve the user information associated with the email, EnronData.org is now offering the CALO Enron Email Dataset in the form of 148 PST files with folder structure, preserving the information in the CALO dataset. Email for each of the 148 identified custodians is available a individual per-custodian PST files. A few minor changes were made to correct names and merge duplicate users where both correct and incorrect names existed. Custom X-headers have also been added including unique IDs to facilitate testing.
The files are currently available for download from the EnronData.org homepage as a 734MB 7z archive. 7z is an archival format similar to ZIP, BZIP2 and RAR that generally achieves higher compression rates. The uncompressed size for this dataset is roughly 8.6 gigabytes.
This dataset is licensed under the Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to “EnronData.org.”
Update 1: If you are experiencing difficulties downloading the file, try using wxDFast, a free open source download manager.
Update 2: Bandwidth management has been implemented. This is likely set too conservatively right now and will be adjusted up soon.
In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc.
Various datasets have used a string consisting generally of lastname and first initial to identify custodians. This ID appears in the FERC dataset as the ORIGIN field and in the CALO dataset as a directory name.
Using this ID, different versions of the Enron email dataset use different numbers of users. Notably, the following numbers of users are used:
- 158 users: The FERC dataset identifies 158 unique users using the iCONECT ORIGIN database column.
- 150 users: The CALO dataset identifies 150 unique users using the maildir user directory structure.
- 149 users: Andrés Corrada-Emmanuel has identified 149 unique users noting that phanis-s is a misspelling of panus-s.
- 148 users: EnronData.org has identified 148 unique users noting that whalley-l is a duplicate of whalley-g, both representing Lawrence “Greg” Whalley.
EnronData.org has verified the duplicates identified by CALO, identified two more duplicates, and corrected two misspellings as shown in this Enron custodian list. This list was created before analyzing Corrada-Emmanuel’s custodian list which correctly identified one of additional duplicates.
After identifying custodians by ID, additional information can be associated with the custodians including names, ranks, titles, email addresses etc. A large part of the work has been performed by Jitesh Shetty and Jafar Adibi in their Ex-Employee Status Report. Combining this data with some additional data allows us to associate this information directly with the custodians in the dataset. The final results are available in the this custodian information report. Corrada-Emmanuel has also created a custodian ID to email address mapping. Email addresses will be incorporated by EnronData.org at a future date.
While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset.
One of the challenges with deduping is which duplicate do you remove and do you leave a link behind? For example, if Alice sends a message to Bob, it will typically exist in at least 3 places, in Alice’s Sent folder, in Alice’s Inbox and in Bob’s Inbox. If you were to remove two of those, which two would you remove and how representative would the resulting dataset be?
There seem to be three solutions to this depending on which problem you are solving.
- Single-Instance Storage (Loss-less Storage Reduction): If storage is a problem, email archiving solutions solve this through SIS, or Single-Instance Storage where multiple copies of an email are stored only once. Emails on the mail server can be replaced by stubs which are emails where the body consists of a pointer to the full email. This way all records are accounted for but the storage costs are dramatically reduced for duplicates.
- Attachment Elimination (Lossy Storage Reduction): Dataset size can be reduced even more by elimination attachments entirely while including attachment information either in the header or body. This can be created to simulate email with Attachment Stubbing where the attachments are no longer available.
- Journal Email (Duplicate Elimination): If the goal is to eliminate duplicates entirely, maintaining user mailbox folders becomes problematic because certain folders will be missing email while others will not be. One way to address this problem is to eliminate folders entirely and move all the email into one or a few global folders, say organized by date instead of user. This is similar to how email archiving via journaling works. With journaling, a copy of each email sent or received, often eliminating duplicates, making it seem like an ideal approach for this requirement.
By using the above approaches, we can reduce the number of duplicates and the storage requirements while maintaining the characteristics of real world email datasets. I’ve added these to the proposed dataset list for consideration. Please let me know if you think these or other datasets would be of use. For now, I’ve added these to the projects page.
The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized.
This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails in the larger CALO dataset.
The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows:
- This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
- It contains data from 150 custodians, mostly senior management of Enron, organized into folders.
- The corpus contains a total of about 0.5M messages.
- This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
- The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available.
- The dataset here
- does not include attachments, and
- some messages have been deleted “as part of a redaction effort due to requests from affected employees”.
- Invalid email addresses were converted to something of the form email@example.com whenever possible (i.e., recipient is specified in some parse-able format like “Doe, John” or “Mary K. Smith”) and to firstname.lastname@example.org when no recipient was specified.
CALO correctly identified 8 duplicate, misspelled custodians in the FERC dataset, resulting in 150 CALO custodians vs. 158 FERC custodians..
In addition to the above, the CALO dataset has a number of optimizations:
- Message-ID: New Message-IDs have been created and used in place of existing Message-IDs
- Date: Dates have been canonicalized replacing the raw dates
- Headers: Some other headers are missing from the email
Removing the attachments makes the dataset much more manageable in size. Mark Dredze has created a version of the CALO dataset with attachment information brought over from the FERC dataset.
K. Krasnow Waterman discusses how these changes affect the email in Knowledge Discovery in Corporate Email: The Compliance Bot Meets Enron, 2006.
The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC data set.
Using the FERC data set has a few challenges, namely:
- Large size: The large size of the dataset (100+GB) means that it isn’t readily downloadable. An online iCONECT interface is available for browsing with attachments. The site is hosted by Lockheed Martin.
- iCONECT format: The data comes as static images and in a flat file database format. The latter are “iCONECT24/7 / Concordance databases in delimited record format, with attachments,” not a standard email form such as MIME, PST, or NSF. The format is described in this WMCU0356_UMD_Transmittal.pdf document.
The dataset is made available in the following formats which are described in the Aspen Systems document.
- Enron Email database
- Enron Email (re-released) database
- Enron Email (.pst) database
- Enron Email (.pst) (re-released) database
- Scanned Documents database
- Scanned Documents (re-released) database
One of the EnronData Project’s goals is to take the FERC email and convert it into properly formatted PST and NSF formats, similar to their original states. A few software vendors have been contacted to see if iCONECT / Concordance databases can be reconstituted into PST / NSF files with attachments without success to date. Without an established solution, the EnronData Project is working on it’s own conversion utilities.
A lot of work has already been formed on the Enron Email Dataset. K. Krasnow Waterman identifies the following datasets in his 2006 report:
|FERC / Aspen
He makes note that different datasets identify different numbers of users. EDRP has identified 158 FERC custodians and 150 CALO users The FERC list was generated by taking a case insensitive list of the iCONECT ORIGIN column and the CALO list was compiled using a directory listing of the CMU hosted tar file. Looking at the comparison quickly, it appears likely that some of the users that were eliminated from the CALO dataset were misspellings.
In follow up posts, individual datasets will be discussed regarding their applicability for different purposes.
Welcome to the EnronData.org (EDO), the Enron Data Reconstruction Project. The collapse of Enron and subsequent public release of Enron data by the FERC has resulted in one of the largest and richest publicly available data sets for email research. This data has been widely and successfully used to support many academic research projects and commercial organizations that require email data; however, much more can be done.
The goals of the EnronData.org are to provide some alternative derivative data sets and to explain some of the more esoteric aspects of the datasets. This project was inspired by examining the current state of this rich dataset including: (a) examining the data itself, (b) listening to requirements from the community, and (c) observing questions people had on existing data sets. If you’ve ever wondered why the Enron email is the way it is, EDRP may be able to explain it for you.
Projects actively being considered by EDO include:
- Native PST and NSF Files: reconstituting PST and NSF email in the most original state possible, including attachments
- Modified Datasets: creating modified datasets for research purposes, e.g. MIME / Maildir with restored headers and attachments if a need is identified
- Directory Load Files: creating files for LDAP servers, Active Directory, and Domino Directory
- Metadata Organization: creating EDRM files to associate metadata with the email files
EDO is actively seeking individuals and organizations that wish to contribute to this effort. If you or your organization would like to assist, please contact John Wang at email@example.com.