One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:
- What months have the most email volume?
- What time of day are most emails sent?
- How quickly do we get responses to our emails?
- Are we sending emails out to customers with inappropriate language?
We used Datameer for practically all of our reports.
When are emails sent?
Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:
From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.
When are we getting responses?
These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails. The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.
Are we sending inappropriate language?
Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:
It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!