Archive for the ‘CALO Dataset’ Category.
In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc.
Various datasets have used a string consisting generally of lastname and first initial to identify custodians. This ID appears in the FERC dataset as the ORIGIN field and in the CALO dataset as a directory name.
Using this ID, different versions of the Enron email dataset use different numbers of users. Notably, the following numbers of users are used:
- 158 users: The FERC dataset identifies 158 unique users using the iCONECT ORIGIN database column.
- 150 users: The CALO dataset identifies 150 unique users using the maildir user directory structure.
- 149 users: Andrés Corrada-Emmanuel has identified 149 unique users noting that phanis-s is a misspelling of panus-s.
- 148 users: EnronData.org has identified 148 unique users noting that whalley-l is a duplicate of whalley-g, both representing Lawrence “Greg” Whalley.
EnronData.org has verified the duplicates identified by CALO, identified two more duplicates, and corrected two misspellings as shown in this Enron custodian list. This list was created before analyzing Corrada-Emmanuel’s custodian list which correctly identified one of additional duplicates.
After identifying custodians by ID, additional information can be associated with the custodians including names, ranks, titles, email addresses etc. A large part of the work has been performed by Jitesh Shetty and Jafar Adibi in their Ex-Employee Status Report. Combining this data with some additional data allows us to associate this information directly with the custodians in the dataset. The final results are available in the this custodian information report. Corrada-Emmanuel has also created a custodian ID to email address mapping. Email addresses will be incorporated by EnronData.org at a future date.
The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized.
This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails in the larger CALO dataset.
The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows:
- This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
- It contains data from 150 custodians, mostly senior management of Enron, organized into folders.
- The corpus contains a total of about 0.5M messages.
- This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
- The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available.
- The dataset here
- does not include attachments, and
- some messages have been deleted “as part of a redaction effort due to requests from affected employees”.
- Invalid email addresses were converted to something of the form email@example.com whenever possible (i.e., recipient is specified in some parse-able format like “Doe, John” or “Mary K. Smith”) and to firstname.lastname@example.org when no recipient was specified.
CALO correctly identified 8 duplicate, misspelled custodians in the FERC dataset, resulting in 150 CALO custodians vs. 158 FERC custodians..
In addition to the above, the CALO dataset has a number of optimizations:
- Message-ID: New Message-IDs have been created and used in place of existing Message-IDs
- Date: Dates have been canonicalized replacing the raw dates
- Headers: Some other headers are missing from the email
Removing the attachments makes the dataset much more manageable in size. Mark Dredze has created a version of the CALO dataset with attachment information brought over from the FERC dataset.
K. Krasnow Waterman discusses how these changes affect the email in Knowledge Discovery in Corporate Email: The Compliance Bot Meets Enron, 2006.