Email Analytics for a Marketing Director Using Timberwolf, HBase, and Datameer

Overview

Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!

 

Email Analytics for Product Evaluation and IT Using HBase, Timberwolf, Datameer, and Karmasphere

Overview

Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!

Email Analytics for Sales Managers Using HBase, Timberwolf, and Datameer

Overview

One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:

  • What months have the most email volume?
  • What time of day are most emails sent?
  • How quickly do we get responses to our emails?
  • Are we sending emails out to customers with inappropriate language?

We used Datameer for practically all of our reports.

When are emails sent?

Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:

From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.

When are we getting responses?

These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails.  The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.

Are we sending inappropriate language?

Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:

Wrap Up

It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!

Getting Started with Big Data Analysis: Datameer

Overview

One can always analyze massive amounts of data with custom map reduce jobs on Hadoop, but usually it's a lot easier to use a pre-packaged analysis tool. Recently in-house we've been experimenting with a number of different analysis tools, and one of our favorites so far has been Datameer. As an analysis tool, it's pretty powerful out of the box and has a ton of capabilities for expansion.

The general workflow pattern for using Datameer is a bit different from many of the others tools we've used, but we really like the Excel-like nature of one of the main steps to generating a report. In general, one can figure everything out just by playing around. (For trickier tasks, they've provided a pretty good set of documentation.) The report we're going to create herein is very simple, but it will illustrate the basic steps required to work through the Datameer workflow.

Our starting point is that we have a subset of the Enron email corpus in one of our HBase tables. Getting email data into HBase is a little outside the scope of this article, but one could use our own Project Timberwolf to get email data out of Exchange and into HBase if they wanted a simple way of getting some starting data. The actual report we want to generate is indeed very simple - who sent the most messages? In the end, we'd also like to see the results of this in an actual chart. Let's get started.

Setting Up Datameer

There are Linux, OSX and Windows flavors of Datameer available, but for simplicity's sake these instructions use the Windows flavor. A trial version of Datameer can be found here. Simply run the installer like any other, and open the newly installed Datameer Analytics Solution (DAS) launcher. Eventually, a launch button will appear in the launcher. Clicking this Launch button will open the DAS in a browser window. The DAS actually runs as a web-based application, and on Windows will pick a random port to open in on localhost by default. Logging in with the default username and password of admin/admin will bring one to the Uploaded Files screen.

Now we can start creating our report.

Adding a Data Store

All of our email data is tucked away in HBase at the moment, so in order to access it we'll need to let Datameer know about our HBase instance. We can do this by first clicking on the Data Stores tab on the left and then the New button. For our type, we selected HBase 0.90.1 (it was the closest) and then Next. The following page wants to know about one's Zookeeper details - modern versions of HBase all work by accessing Zookeeper instead of HBase directly. We entered the quorum appropriate for us, and the default Zookeeper port is 2181.  We didn't select any permissions options, and on the next page we didn't add a description either. We saved the data store as "demo_ds". Upon saving, we're once again presented with a list of data stores, with "demo_ds" as an option.

Creating a Data Link

We've told Datameer about our HBase data store, but we still need to create a data link which is how we actually access our email data. To start this, we first click the "Data Links" tab on the left, and then "New," and we're given the option to select a data store. Select the one we just created, in our case "demo_ds". The next page asks us to choose a table for the data link, which is determined by what tables are actually in one's data store (ours was actually called "enron"). For the purposes of these instructions, one can just skip the rest of the new Data Link queries and go all the way to Save. We saved our data link as "demo_dl". The Data Links overview page should contain the new data link.

Creating a new Workbook from a Data Link

Workbooks are how we actually manipulate our email data in Datameer. They are analogous to a column-only Excel workbook, meaning that one only manipulates data by entire columns, and not individual cells or rows. One of the easiest ways to create a new workbook is from an existing data link. From the Data Links tab, if we click directly on the data link we just created, "demo_dl", and then click the Refresh Sample button on the resulting page, we'll be greeted with a view like this:

To create our new workbook, click the Link Data in new Workbook button.

Creating the Data for our new Report

After clicking the Link Data in new Workbook button, we'll be brought to our new workbook and the default worksheet, which is a view on all of the data as stored in our data store.

Default Data Link Worksheet

Each row represents a single email. Note how similar this view is to Excel. Remember that our end goal is to have a chart which tells us who sent the most messages. To do this correctly, we'd like to have a single worksheet which has the data setup perfectly for our chart at the end of this step. Basically, it would contain an email address in one column and the number of emails its owner sent in another. At the bottom of the worksheet are the worksheet tabs. We can just click on on the "+ New" button to create a new worksheet. It's probably best to rename the worksheet to something better than Sheet1. Right-clicking on the sheet name tab will bring up a context menu with a Rename option. We decided to rename the sheet to "SenderCount". To get data into our new worksheet, we can click anywhere in a column and enter a formula. Generally, we want to get data from elsewhere in our workbook, like our default worksheet created by our data link. We can do this with a formula like "=#demo_dl!h_Sender", which will grab the "h_Sender" column from the "demo_dl" worksheet. The result looks like this:

Simple Function

For clarity, let's rename column A in our new worksheet to something intelligible, like "Sender". We do this the same way as renaming a worksheet, by right-clicking and selecting Rename. Remember that in our data, each row represents a single email. Thus, the exact same sender email address will appear each time that user sent a message. This means that there will be duplicate entries in the column for anyone user who sent more than one email. What we'd like to do is group all of the senders by their email address and count how many times they appeared. We can do this with Datameer's grouping functions. In particular, we are interested in the GROUPBY and GROUPCOUNT functions. GROUPBY will group all the records where each entry in a specific column are the same. GROUPCOUNT will count the number of records in the group. If we change the Sender column's function to "=GROUPBY(#demo_dl!h_Sender)", we will group all of the records in the Sender column. Our worksheet looks like this:

Simple Groupby

Note that all of the duplicates have been removed via grouping. To get the actual count for each group we can enter in "GROUPCOUNT()" for the formula for column B. Note that the actual count for each group is the same thing as the count for each sender, and since each row represents one email, it's also how many times each sender sent an email! We should rename column B to "Count" while we're at it. The worksheet should now look like this:

Groupcount

We're on the home stretch for getting our data set up. We now know how many times each sender sent an email, but we want to know who among them did it the most. To figure this out, we'll need to sort our sheet. There's a button for doing this, named "Sort Sheet," above the worksheet. We want to sort over the Count column in a descending manner, and we only want the top ten entries. Applying the sort will actually create a brand new worksheet (which is read only). We renamed ours to "SenderCountSorted". The resulting worksheet should look like this:

Sort

Now save the workbook! We called ours "demo_wb". We still have to run it at least once for it to have data in it, so click the Analytics tab at the top, select our new workbook by radio button, and click the Run button. After a short time, the workbook should be populated with data from the data store.

Creating a Chart for our Report

We finally have our data in a presentation format that will be easy to create a chart out of. In Datameer, groups of charts and other widgets are called dashboards. Clicking the Dashboards tab at the top and then New will allow us to create a new dashboard. Select the dashboard to only have one column and leave everything else as default. We decided to name ours "demo_dashboard" on the Save screen. Our new default dashboard should look something like what's below.

Default Dashboard

If we want to show our email counts as a bar chart, we need only drag the bar chart widget on the left into our single column on our dashboard. Clicking the configure button allows us to populate it with data. We want to take our data from our new workbook, "demo_wb". On the next screen, we'd also like to use the SenderCountSorted table. On the last page, for readability, under the Style tab is the X Label Rotation option. Switch it to vertical. Every proper chart should have a title, and ours is no different. Under the Label tab is a title text field. We entitled our chart: "Top Ten Email Senders in a Subset of the Enron Corpus." Clicking save will bring us to a view of our new chart, like this:

Chart

This is actually still a view on the dashboard editor, so we'd be better off viewing it in view mode. If one clicks "Dashboards" on the top and then right on the "demo_dashboard" dashboard, they'll see it in view mode. It looks pretty rad.

Conclusions

As far big data analytics go, this process is pretty painless. Assuming HBase is good to go, and already has all of the data we need in it, creating the workbook and resulting chart in Datameer should take less than a half hour. If we needed to actually hand write our map reduce jobs to calculate all of this, it would take orders of magnitude longer, and in the end, we wouldn't even have a pretty chart! This is just the very tip of what one can do with Datameer on email data, so we recommend playing around with it and see what sort of reports they can make. Make a useful report? Have a question about all of this? The comments await!