Archive for the ‘FERC Dataset’ Category.
In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc.
Various datasets have used a string consisting generally of lastname and first initial to identify custodians. This ID appears in the FERC dataset as the ORIGIN field and in the CALO dataset as a directory name.
Using this ID, different versions of the Enron email dataset use different numbers of users. Notably, the following numbers of users are used:
- 158 users: The FERC dataset identifies 158 unique users using the iCONECT ORIGIN database column.
- 150 users: The CALO dataset identifies 150 unique users using the maildir user directory structure.
- 149 users: Andrés Corrada-Emmanuel has identified 149 unique users noting that phanis-s is a misspelling of panus-s.
- 148 users: EnronData.org has identified 148 unique users noting that whalley-l is a duplicate of whalley-g, both representing Lawrence “Greg” Whalley.
EnronData.org has verified the duplicates identified by CALO, identified two more duplicates, and corrected two misspellings as shown in this Enron custodian list. This list was created before analyzing Corrada-Emmanuel’s custodian list which correctly identified one of additional duplicates.
After identifying custodians by ID, additional information can be associated with the custodians including names, ranks, titles, email addresses etc. A large part of the work has been performed by Jitesh Shetty and Jafar Adibi in their Ex-Employee Status Report. Combining this data with some additional data allows us to associate this information directly with the custodians in the dataset. The final results are available in the this custodian information report. Corrada-Emmanuel has also created a custodian ID to email address mapping. Email addresses will be incorporated by EnronData.org at a future date.
The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized.
This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails in the larger CALO dataset.
The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC data set.
Using the FERC data set has a few challenges, namely:
- Large size: The large size of the dataset (100+GB) means that it isn’t readily downloadable. An online iCONECT interface is available for browsing with attachments. The site is hosted by Lockheed Martin.
- iCONECT format: The data comes as static images and in a flat file database format. The latter are “iCONECT24/7 / Concordance databases in delimited record format, with attachments,” not a standard email form such as MIME, PST, or NSF. The format is described in this WMCU0356_UMD_Transmittal.pdf document.
The dataset is made available in the following formats which are described in the Aspen Systems document.
- Enron Email database
- Enron Email (re-released) database
- Enron Email (.pst) database
- Enron Email (.pst) (re-released) database
- Scanned Documents database
- Scanned Documents (re-released) database
One of the EnronData Project’s goals is to take the FERC email and convert it into properly formatted PST and NSF formats, similar to their original states. A few software vendors have been contacted to see if iCONECT / Concordance databases can be reconstituted into PST / NSF files with attachments without success to date. Without an established solution, the EnronData Project is working on it’s own conversion utilities.