[Image via ICT4Accountability]

Yesterday and today, I attended the first ever Big Data class at MIT Sloan . The lecturers were Erik Brynjolfsson and Alex 'Sandy' Pentland. I'd previously heard Erik speak at MIT in October (when I first heard in-depth about the components of Big Data), and I've since read his book Race Against the Machine: How the Digital Revolution is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy (highly recommended). I had high expectations, and they were ultimately exceeded.

It will take me a bit of time to catch up on everything I wrote down in 50 pages of notes. Due to a combination of no hotel wifi, Amex fraud false positives and Verizon order complexity, I had only my T-Mobile BlackBerry for connectivity on the first morning of the conference. For the first time in a long time, I took notes on paper throughout the class. This gave me mixed feelings. It was certainly nice to create diagrams easily, use different fonts and means of emphasis, create my own notation for action items and areas to research. However, now I'm left with a fragile notebook that I'm paranoid about losing and hours of transcription ahead.

Following a quick and dirty data mine of my own notes, here are some of the most interesting topics, insights, theories, and quotes from the day:

Topics:

  • Balancing experienced gut vs data
  • Learning to discriminate correlation vs. causality
  • Effectiveness of different communication media for communicating and learning
  • Social graph patterns for creative vs cohesive groups
  • How to continuously run experiments and use Overall Evaluation Criteria
  • Stages of Organizational Evolution: Hubris, Measurement, Semmelweis Reflex, Fundamental Understanding.
  • Big Data Out in the Wild
  • Techniques for Building Viral Adoption
  • Email Analytics: Productivity and Information Diffusion
  • Privacy Legislative Issues in the US and Europe
  • Personal Data as Asset Class
  • The Matrix of Change

Insights and theories: 

  • Companies born on the web, such as Amazon, Facebook, and Google perform hundreds of experiments per day.
  • The Hawthorne Effect: letting people know that they are being experimented on changes their behavior.
  • Social metrics: between-ness, centrality, constraints, geodesic distance
  • Behavioral demographics (where you go, who you hang out with) are a more precise form of defining identity than iris scans or fingerprints.
  • Researchers are able to diagnose depression just by observing cell phone usage. 
  • The Panopticon: who needs a physical surveillance tower in the smart phone age?
  • Tuyman's Law: Any statistic that appears interesting is almost certainly a mistake.

Vox populi:

  • "A wealth of information creates a poverty of attention." -Herbert Simon
  • "When physicists have data that is too noisy, they build a better tool for finer resolution." - Erik Brynjolfsson
  • "Big data is a mental prosthetic." - Erik Brynjolfsson
  • "People are bundles of habits formed by the people around them." - Pentland (more so than a person's friends or peers).
  • "Go get the data! Don't argue about designs."
  • "To have a great idea, have a lot of them" -Edison
  • Lord Lever's Quandry: "Half of my marketing budget is wasted. I just don't know which half."
  • "Where you spend your time is who you are." - Pentland
  • "In Hong Kong, you'll buy everything but your house on your phone." -Pentland referring to the all knowing Octopus card.
  • "70% of all workers are information workers." unattributed.
  • "People care about privacy, but if you offer them an Amazon Gift Card, they will turn it over." -Pentland
  • "Gender predicts information diffusion, but not productivity" -Erik Brynjolfsson, from data on email analytics
  • "People being rational is an abominable model, but all economics is based upon it." -Pentland
  • "We are not concerned about data privacy. We don't give data to anyone except the government." Manager from China Mobile

I also learned that, as CEO, mine is the HiPPO (http://exp-platform.com/whatsahippo.aspx), ( term coined by Ronny Kohavi of Microsoft) which is fraught with danger for the organization.

Overview

Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!

 

Posted
AuthorNick

Overview

Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!

Posted
AuthorNick

Overview

One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:

  • What months have the most email volume?
  • What time of day are most emails sent?
  • How quickly do we get responses to our emails?
  • Are we sending emails out to customers with inappropriate language?

We used Datameer for practically all of our reports.

When are emails sent?

Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:

From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.

When are we getting responses?

These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails.  The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.

Are we sending inappropriate language?

Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:

Wrap Up

It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!

Posted
AuthorNick
CategoriesBig Data