Overview

Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!

Posted
AuthorNick

Overview

One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:

  • What months have the most email volume?
  • What time of day are most emails sent?
  • How quickly do we get responses to our emails?
  • Are we sending emails out to customers with inappropriate language?

We used Datameer for practically all of our reports.

When are emails sent?

Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:

From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.

When are we getting responses?

These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails.  The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.

Are we sending inappropriate language?

Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:

Wrap Up

It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!

Posted
AuthorNick
CategoriesBig Data

[Updated 12.19.12: We are now using Mongodb for Gander. Still, we spent several months getting decently cozy with HBase, so if you have any questions, feel free to ask in the comments or on twitter!]

I recently needed to add Apache HBase to my pseudo-distributed installation of Apache Hadoop. Though in the process of installing it, I hit a number of obstacles which were poorly documented. I ultimately got it working, and here's what I did:

Download

First of all, I downloaded hbase 0.90.4 from http://www.apache.org/dyn/closer.cgi/hbase/. Specifically, I got "hbase-0.90.4.tar.gz". Later versions may work, but I have some third party tools that won't work on later versions.

Install

I unzipped the contents of the package and moved it into /usr/local/hbase. So the directory structure is like:

 /usr/local/hbase/
     bin/
     conf/
     lib/
     ...

Configure Environment

I exported some variables in my ~/.bashrc

export HBASE_HOME=/usr/local/hbase

I believe this one is required usually.

export PATH=${PATH}:${HBASE_HOME}/bin

This one isn't strictly necessary, but it puts hbase on the path as well as some other shell scripts they supply. You could also make a symlink to "/usr/local/hbase/bin/hbase" and put that in /usr/local/bin or something.

Configure

Next, I set up the hbase adding the following to these files:

/usr/local/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HBASE_LOG_DIR=/tmp/hbase/logs

Your JAVA_HOME may be different. And of course, if you already have JAVA_HOME in your environment, you shouldn't need to do this. Also, you don't have to set HASE_LOG_DIR. But otherwise, HBase writes to /usr/local/hbase/logs, which means you'll need to give the account running hbase write permissions to /usr/local/hbase, which I didn't want to do.

/usr/local/hbase/conf/hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Of course, you'll want to use whatever port you configured hdfs for, which might not be 9000.

"Extra" steps

There were a few things I needed to do before my hbase installation worked correctly.

/etc/hosts

The first one was to edit my "/etc/hosts" and change this line...

127.0.1.1    ubuntu

into this...

127.0.0.1    ubuntu

This seems to be a necessary work around to deal with HBASE-5004 which is still an issue this the time I'm writing this (01/13/2012). The "127.0.1.1    ubuntu" configuration is present for standard ubuntu distributions, but you might not have exactly this anymore. But if you're encountering "Timed out" problems when using the hbase shell, it's probably related to /etc/hosts.

Copy over jars

I needed to copy over certain jars from my hadoop installation to my hbase lib directory. Specifically, I needed to run:

cp ${HADOOP_HOME}/hadoop-core-*.jar   ${HBASE_HOME}/lib/
cp ${HADOOP_HOME}/lib/commons-configuration-*.jar   ${HBASE_HOME}/lib/

Why? Presumably the jars HBase ships with were incompatible with my version of Hadoop. All I know is that without it, I was getting errors like: "HBase is able to connect to ZooKeeper but the connection closes immediately".

Finish

This is what it took for me to get set up. Considering how quickly the Hadoop environment is changing, these steps may likely obsolete (if they're not already). But I hope someone derives some use from this.

Posted
AuthorCameron
CategoriesUncategorized

I’ve personally experienced four revolutions in software. As we’ve gone from Unix BSD and VMS minicomputers to the PC, and then to the explosion of the web, I’ve seen the pendulum oscillate between centralized and decentralized environments. The current momentum is now clearly towards friendly mobile computers coupled with powerful, scalable services. The technological philosophy behind the next fat wave is already in use at the majority of major web properties, such as Google, Facebook, and Twitter. It’s called Big Data and is starting to trickle into the most forward-facing enterprises. The specific area of Big Data we find fascinating is its ability to store and analyze unstructured data at web scale, and therefore at enterprise scale as well. This ability opens exciting new possibilities, built on open source projects like Hadoop, using commodity hardware.

Riparian Data was spun out of SoftArtisans, a company I founded in my basement which has grown to serve thousands of large organizations and enterprise customers around the world. Riparian Data will use our extensive experience with documents and other unstructured data sources to bring new value to many of these same enterprises.

Our first release is already available as a free open source project: Timberwolf imports Microsoft Exchange email into Hadoop / HBase. You can read more about it here, and download the source code here.

We look forward to working with you, our long time enterprise customers, and our many partners, in realizing the potential of this new wave.

Posted
Authorwihl