A blog about security, privacy, algorithms, and email in the enterprise. 

Viewing entries tagged


Whatever You Type May Be Held Against You: A Patchwork History of Incriminating Emails

incriminating emailWhen we were developing Timberwolf, we needed a large quantity of public emails to ingest, so we used Enron’s. Nick wrote a post a while back on the results of some of the analytics he’d run against the Enron emails using Datameer. Mostly, he looked at things like frequency-by-address and frequency-by-time, but he also did some fun text analysis. Did you know that “fiddlesticks” was the profanity of choice at Enron? It wasn’t, but you can see what was here. Anyways, that post coupled with the recent revelations regarding certain Goldman Sachs employees’ opinions of their clients and their clients’ capital got me thinking about how easily the contents of your professional email can become, to some extent, public domain, and how, given the opportunity, it can be used against you. The following six instances are but a few examples of loose lips that sunk, or almost sunk, their owners’ ships.

Incriminated: John Kiriakou, CIA When: 2012 Case: United States of America vs. John Kiriakou Backstory: Kiriakou was a CIA officer from 1990 to 2004. In 2007, he started to get a little chatty with the media. Among other things, he he disclosed the identity of one covert officer involved in the interrogation of terror, and admitted he'd lied to the Publications Review Board of the CIA in an attempt to get the approval to include classified information in his book The Reluctant Spy: My Secret Life in the CIA's War on Terror. Choice quote: "I laid it on thick." Sources: Reuters, Talking Points Memo

Incriminated: Goldman Sachs and Merrill Lynch (BOFA) When: 2012 Case: vs Goldman Sachs, vs Bank of America Backstory: was suing Goldman for naked short selling their stock, and filed a motion to unseal certain documents they felt contained information essential to proving their case. Goldman opposed the motion, but in doing so, one of their lawyers, Joe Floren, filed an unredacted version of Overstock's motion that contained, well, exactly the proof Overstock was looking for. Choice quotes:

  • Peter Melz, Merrill Pro: "Fuck the compliance area – procedures, schmecedures"
  • Thomas Tranfaglia, Merrill Pro: "We are NOT borrowing negatives… I have made that clear from the beginning. Why would we want to borrow them? We want to fail them.”
  • Unnamed GS exec: “We have to be careful not to link locates to fails [because] we have told the regulators we can’t"
  • Unnamed GS exec, in regards to Overstock in specific: "“Two months ago 107% of the floating [aka available stock] was short!”

Sources: Bloomberg, Rolling Stone

Incriminated: Google When: 2011-2012 Case: Oracle America Inc. v. Google Inc Backstory: Before Google vs Oracle went to trial, the search giant tried to prevent a certain email from engineer Tim Lindholm to Google's head of mobile Andrew Rubin from being used as evidence. Why? In it, Lindholm voices his opinion that Google should negotiate a license for Java technology—indicating, Oracle lawyer David Boies argued, that Lindholm had "specific, detailed working knowledge" of one of the patents at issue in the lawsuit.Google lost that particular battle, but won the war. Choice quote: "We've been over a bunch of these [alternatives to Java], and think they all suck… We conclude that we need to negotiate a license for Java under the terms we need." Sources: Businessweek, Computer World

Incriminated: Sarah Palin When: 2011 Backstory: Under the Freedom of Information Act, the Alaska state government made public much of its former governor's correspondence. To be honest, the majority of the emails’ contents was not exactly scandal-fodder. What they did confirm: Palin's grammar is beauty-queen polished, she enjoys a good (glowing) write-off, and she allocated some of the decision-making power to her husband. Choice quotes:

  • "Do u remember who the barber is who's going to trim my hair?"
  • "I think we're fine if we include [the electricity required to operate a tanning bed] with the w/d on the 3rd floor…On a day like today-I wish the bed was ready to go for you to use right away!!" --from Governor's House manager Erika Fagerstrom.
  • "I'm getting calls from Soldotna about the next judge appointment. Is [Redacted] on the list, I'm getting calls from folks hoping he's not selected. Let me know whats happening so I can put to rest some of the rumors." --from Todd Palin to his wife’s political aide, Ivy Frye

Source: MSNBC, The Guardian

Incriminated: Neville Thurlbeck, James Murdoch When: 2008-2011 Backstory: After private detective Glenn Mulcaire was jailed for enabling News of the World’s royal editor Clive Goodman to hack into the voicemails of three royal staffers,  his files were seized by Scotland Yard. The lawyers of Professional Football Association head Gordon Taylor, who had filed a libel suit against the paper, obtained by court order a copy of one email that shed light on just how systemic the illegal practice was at the paper. The email was sent by NotW reporter Ross Hindley to Detective Mulcaire, and includes an attachment, titled "Transcript for Neville," that contains 32 voice messages between soccer star Gordon Taylor and his girlfriend. The "Neville" in question is Neville Thurlbeck, then the paper's chief reporter. Source: The Guardian

Incriminated: Dennis Kozlowski When: 2005 Case: People of the State of New York vs. Denis Kozlowski and Mark Schwartz Backstory: The former CEO of Tyco had a penchant for dipping into the company piggybank--to the eventual tune of $150 million. After getting convicted, in 2005, of conspiracy, grand larceny, falsifying recordts, and securities theft, Kozlowski was sentenced to 8 1/3-25 years in prison and ordered to pay a $70 million fine and $134 in restitution. He’s up for parole in 2014, but his days of $6,000 shower curtains and urinating ice sculptures are probably over. Choice quote: "Something funny . . . is likely apparent if any decent accountant looks at this." --from Kozlowski's lawyer to Tyco's in-house attorney. Sources: New York, CNN



Email Analytics for a Marketing Director Using Timberwolf, HBase, and Datameer


Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!




Email Analytics for Product Evaluation and IT Using HBase, Timberwolf, Datameer, and Karmasphere


Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!


1 Comment

Email Analytics for Sales Managers Using HBase, Timberwolf, and Datameer


One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:

  • What months have the most email volume?
  • What time of day are most emails sent?
  • How quickly do we get responses to our emails?
  • Are we sending emails out to customers with inappropriate language?

We used Datameer for practically all of our reports.

When are emails sent?

Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:

From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.

When are we getting responses?

These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails.  The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.

Are we sending inappropriate language?

Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:

Wrap Up

It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!

1 Comment