Overview

Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!

 

Comment