We're in the process of adding Hive support to Timberwolf, which involves writing files into HDFS so that they can get loaded into Hive tables. Writing to HDFS involves FSDataOutputStreams and FSDataInputStreams, which are all fine and good until you want to start writing tests. My normal approach when testing something that writes to a stream is to create it with a stream that's ultimately backed by a byte array (generally through ByteArrayOutputStream), then pull those bytes out and verify that they're all what I expect them to be. In this case, I was writing a sequence file, so I figured I could use SequenceFile.Reader to pull out my key/value pairs and check that they're correct. That is, until I tried constructing an FSDataInputStream with a ByteArrayInputStream.

Turns out, FSDataInputStream imposes requirements on its backing streams that aren't reflected in the constructor's type signature: FSDataInputStream#FSDataInputStream. So I needed to get a stream that I could construct from a byte array that also implemented PositionedReadable and Seekable. As it turns out, there isn't one of those in the org.apache.hadoop.fs namespace, so I went ahead and rolled my own: SeekablePositionedReadableByteArrayInputStream. It's not complete, since I wasn't sure what exactly seekToNewSource should do and I didn't need it for my tests, but it gets enough of the job done. Maybe it'll help you, too?

Posted
Authorsean
CategoriesTimberwolf

Overview

Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!

 

Posted
AuthorNick

Overview

Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!

Posted
AuthorNick

I’ve personally experienced four revolutions in software. As we’ve gone from Unix BSD and VMS minicomputers to the PC, and then to the explosion of the web, I’ve seen the pendulum oscillate between centralized and decentralized environments. The current momentum is now clearly towards friendly mobile computers coupled with powerful, scalable services. The technological philosophy behind the next fat wave is already in use at the majority of major web properties, such as Google, Facebook, and Twitter. It’s called Big Data and is starting to trickle into the most forward-facing enterprises. The specific area of Big Data we find fascinating is its ability to store and analyze unstructured data at web scale, and therefore at enterprise scale as well. This ability opens exciting new possibilities, built on open source projects like Hadoop, using commodity hardware.

Riparian Data was spun out of SoftArtisans, a company I founded in my basement which has grown to serve thousands of large organizations and enterprise customers around the world. Riparian Data will use our extensive experience with documents and other unstructured data sources to bring new value to many of these same enterprises.

Our first release is already available as a free open source project: Timberwolf imports Microsoft Exchange email into Hadoop / HBase. You can read more about it here, and download the source code here.

We look forward to working with you, our long time enterprise customers, and our many partners, in realizing the potential of this new wave.

Posted
Authorwihl