One can always analyze massive amounts of data with custom map reduce jobs on Hadoop, but usually it's a lot easier to use a pre-packaged analysis tool. Recently in-house we've been experimenting with a number of different analysis tools, and one of our favorites so far has been Datameer. As an analysis tool, it's pretty powerful out of the box and has a ton of capabilities for expansion.
The general workflow pattern for using Datameer is a bit different from many of the others tools we've used, but we really like the Excel-like nature of one of the main steps to generating a report. In general, one can figure everything out just by playing around. (For trickier tasks, they've provided a pretty good set of documentation.) The report we're going to create herein is very simple, but it will illustrate the basic steps required to work through the Datameer workflow.
Our starting point is that we have a subset of the Enron email corpus in one of our HBase tables. Getting email data into HBase is a little outside the scope of this article, but one could use our own Project Timberwolf to get email data out of Exchange and into HBase if they wanted a simple way of getting some starting data. The actual report we want to generate is indeed very simple - who sent the most messages? In the end, we'd also like to see the results of this in an actual chart. Let's get started.
Setting Up Datameer
There are Linux, OSX and Windows flavors of Datameer available, but for simplicity's sake these instructions use the Windows flavor. A trial version of Datameer can be found here. Simply run the installer like any other, and open the newly installed Datameer Analytics Solution (DAS) launcher. Eventually, a launch button will appear in the launcher. Clicking this Launch button will open the DAS in a browser window. The DAS actually runs as a web-based application, and on Windows will pick a random port to open in on localhost by default. Logging in with the default username and password of admin/admin will bring one to the Uploaded Files screen.
Now we can start creating our report.
Adding a Data Store
All of our email data is tucked away in HBase at the moment, so in order to access it we'll need to let Datameer know about our HBase instance. We can do this by first clicking on the Data Stores tab on the left and then the New button. For our type, we selected HBase 0.90.1 (it was the closest) and then Next. The following page wants to know about one's Zookeeper details - modern versions of HBase all work by accessing Zookeeper instead of HBase directly. We entered the quorum appropriate for us, and the default Zookeeper port is 2181. We didn't select any permissions options, and on the next page we didn't add a description either. We saved the data store as "demo_ds". Upon saving, we're once again presented with a list of data stores, with "demo_ds" as an option.
Creating a Data Link
We've told Datameer about our HBase data store, but we still need to create a data link which is how we actually access our email data. To start this, we first click the "Data Links" tab on the left, and then "New," and we're given the option to select a data store. Select the one we just created, in our case "demo_ds". The next page asks us to choose a table for the data link, which is determined by what tables are actually in one's data store (ours was actually called "enron"). For the purposes of these instructions, one can just skip the rest of the new Data Link queries and go all the way to Save. We saved our data link as "demo_dl". The Data Links overview page should contain the new data link.
Creating a new Workbook from a Data Link
Workbooks are how we actually manipulate our email data in Datameer. They are analogous to a column-only Excel workbook, meaning that one only manipulates data by entire columns, and not individual cells or rows. One of the easiest ways to create a new workbook is from an existing data link. From the Data Links tab, if we click directly on the data link we just created, "demo_dl", and then click the Refresh Sample button on the resulting page, we'll be greeted with a view like this:
To create our new workbook, click the Link Data in new Workbook button.
Creating the Data for our new Report
After clicking the Link Data in new Workbook button, we'll be brought to our new workbook and the default worksheet, which is a view on all of the data as stored in our data store.
Each row represents a single email. Note how similar this view is to Excel. Remember that our end goal is to have a chart which tells us who sent the most messages. To do this correctly, we'd like to have a single worksheet which has the data setup perfectly for our chart at the end of this step. Basically, it would contain an email address in one column and the number of emails its owner sent in another. At the bottom of the worksheet are the worksheet tabs. We can just click on on the "+ New" button to create a new worksheet. It's probably best to rename the worksheet to something better than Sheet1. Right-clicking on the sheet name tab will bring up a context menu with a Rename option. We decided to rename the sheet to "SenderCount". To get data into our new worksheet, we can click anywhere in a column and enter a formula. Generally, we want to get data from elsewhere in our workbook, like our default worksheet created by our data link. We can do this with a formula like "=#demo_dl!h_Sender", which will grab the "h_Sender" column from the "demo_dl" worksheet. The result looks like this:
For clarity, let's rename column A in our new worksheet to something intelligible, like "Sender". We do this the same way as renaming a worksheet, by right-clicking and selecting Rename. Remember that in our data, each row represents a single email. Thus, the exact same sender email address will appear each time that user sent a message. This means that there will be duplicate entries in the column for anyone user who sent more than one email. What we'd like to do is group all of the senders by their email address and count how many times they appeared. We can do this with Datameer's grouping functions. In particular, we are interested in the GROUPBY and GROUPCOUNT functions. GROUPBY will group all the records where each entry in a specific column are the same. GROUPCOUNT will count the number of records in the group. If we change the Sender column's function to "=GROUPBY(#demo_dl!h_Sender)", we will group all of the records in the Sender column. Our worksheet looks like this:
Note that all of the duplicates have been removed via grouping. To get the actual count for each group we can enter in "GROUPCOUNT()" for the formula for column B. Note that the actual count for each group is the same thing as the count for each sender, and since each row represents one email, it's also how many times each sender sent an email! We should rename column B to "Count" while we're at it. The worksheet should now look like this:
We're on the home stretch for getting our data set up. We now know how many times each sender sent an email, but we want to know who among them did it the most. To figure this out, we'll need to sort our sheet. There's a button for doing this, named "Sort Sheet," above the worksheet. We want to sort over the Count column in a descending manner, and we only want the top ten entries. Applying the sort will actually create a brand new worksheet (which is read only). We renamed ours to "SenderCountSorted". The resulting worksheet should look like this:
Now save the workbook! We called ours "demo_wb". We still have to run it at least once for it to have data in it, so click the Analytics tab at the top, select our new workbook by radio button, and click the Run button. After a short time, the workbook should be populated with data from the data store.
Creating a Chart for our Report
We finally have our data in a presentation format that will be easy to create a chart out of. In Datameer, groups of charts and other widgets are called dashboards. Clicking the Dashboards tab at the top and then New will allow us to create a new dashboard. Select the dashboard to only have one column and leave everything else as default. We decided to name ours "demo_dashboard" on the Save screen. Our new default dashboard should look something like what's below.
If we want to show our email counts as a bar chart, we need only drag the bar chart widget on the left into our single column on our dashboard. Clicking the configure button allows us to populate it with data. We want to take our data from our new workbook, "demo_wb". On the next screen, we'd also like to use the SenderCountSorted table. On the last page, for readability, under the Style tab is the X Label Rotation option. Switch it to vertical. Every proper chart should have a title, and ours is no different. Under the Label tab is a title text field. We entitled our chart: "Top Ten Email Senders in a Subset of the Enron Corpus." Clicking save will bring us to a view of our new chart, like this:
This is actually still a view on the dashboard editor, so we'd be better off viewing it in view mode. If one clicks "Dashboards" on the top and then right on the "demo_dashboard" dashboard, they'll see it in view mode. It looks pretty rad.
As far big data analytics go, this process is pretty painless. Assuming HBase is good to go, and already has all of the data we need in it, creating the workbook and resulting chart in Datameer should take less than a half hour. If we needed to actually hand write our map reduce jobs to calculate all of this, it would take orders of magnitude longer, and in the end, we wouldn't even have a pretty chart! This is just the very tip of what one can do with Datameer on email data, so we recommend playing around with it and see what sort of reports they can make. Make a useful report? Have a question about all of this? The comments await!