Photo credit: Gagan Moorthy

Photo credit: Gagan Moorthy

Much has already been said of the Obama for America Tech Team, but it's best to hear it from the (Trojan?) horses' mouths. Harper Reed was the CTO of Obama for America Tech Team, celebrated for both his considerable engineering chops and his punk woodsman ethos. Michael Slaby, the Chief Innovation & Integration Officer, was perhaps less celebrated, but he managed to  get the bleedin' edge tech team to work productively with the traditional campaign team. Couldn't have been easy, and by all accounts he did a stellar job. 

The first part of the talk contrasted the use of technology in the '08 campaign vs the '12 campaign. The campaign staff were opportunistic consumers of technology in '08, and many of the applications were haphazard. There were two core problems in '08: 

  1.  Operations and data were silo-ed by department
  2. There was a huge gap between online and offline data.

In '12, the campaign addressed the first problem with Narwhal, a single shared data store for all of the campaign's applications (counting chickins, the Romney tech team named their database Orca, the Narwhal's only predator). They addressed the second problem with Dashboard, an online dashboard for offline volunteers. 

The biggest problem in '12 was managing the cultural conflict between new tech and traditional political campaign strategies. The new tech team came in as outsiders and were celebrated as outsiders. The outsiders considered the insiders' strategies to be "old"; the insiders considered the outsiders' strategies to be "risky." When you add to this the programmer's reflexive "I can fix this" response, you have the recipe for an internal battlefield. 

Why didn't that happen? Because Harper and Slaby believed technology is most effective when it's an empowering function that spreads across the entire organization–that is to say, not novelty for novelty's sake. Key to empowerment: ship good products. 

Technology also helps break down the hierarchy that is so entrenched in politics. A core reason the Obama tech team had a leg up on the Romney team was that former participated in all leadership meetings; they weren't shunted off to a corner. 

The tech team knew for database failures in '08 that they needed to stand on the shoulders of tech giants. They aggressively hired people who had previous experience with rapid scaling. The average age of an engineer on the tech team was around 40. Eric Schmidt was Harper's intern. 

The campaign's website was static html on s3. Harper said that's the fastest way to host anything, and it won't go down. Or rather, it won't go down if you prepare for failure, which the team did. You should fail fast, hard, and in a way that is safe for your organization. If Netflix is down and Amazon is down, you're down. If Amazon is down and Netflix is up, you messed up.

The team did a lot of A/B testing to see which elements would increase donations. They used a combination of Optimizely and custom-built systems. Also, everything they did was responsive, which enabled them to work across all devices. They designed for tablets, because a mobile web site built for tablet size will look okay on phones and desktops. Listen to the people you're engaging with--they will tell you what they want and what they don't.

Complex data analysis--and the actions resulting from it--gave Obama a competitive edge over Romney, but Slaby said what really matters wasn't the technology itself, but the culture of innovation that enabled it. 


Posted
AuthorClaire Willett
 

Gander stores user data in Mongo, the Bon Iver of distributed databases. So far, we're digging Mongo, but it's definitely a mix of guns and roses. If you're using it or thinking about trying it, hopefully, some of the following advice and information will be helpful!

Tips 'n' Tricks

1. Pretty Printing from the Command Line Shell:

When using the shell, the results are often lumped together into a single line. Use the .pretty() method to format nicely, as in:

 > db.emails.find({_id:/b8037de0-1170*/}, {headers:1}).pretty();

If you only have one element that you are looking at, specifically indexing it will also give pretty results:

 > db.emails.find({_id:/b8037de0-1170*/}, {headers:1})[0]

or

 > db.emails.findOne({_id:/b8037de0-1170*/}, {headers:1})

2. Backup The Production Database Locally:

$ mongodump -h <your mongo instance> --port <port> -d Gander -u <user> -p '<password>'

(enclose the password in apostrophes to prevent the shell from interpreting special characters).

This will create a subdirectory called ./dump that contains the exported database.

3. Restoring a Mongo Dump to a Local Meteor Instance

$  mongorestore -h localhost --port 3002

This assumes that the current directory has a subdirectory ./dump created from a previous backup and that Meteor is running locally.

The production database is named Gander, which is the database that will be created by this restore. By default, the local Meteor database is named 'meteor'. To rename the local database, open the mongo command line tool:

> db.copyDatabase( "Gander", "meteor" )

There may be a faster way to copy databases between servers, but I haven't tried it yet.

4. Retrieving Specific Subcollection Fields

To retrieve certain fields from the top level of a collection, use the field selector:

> db.collection.find ({}, {field1:1, field2:1});

If you wanted just fields of a subcollection, use apostrophes:

> db.collection.find({}, {'field1.subfield1':1, 'field1.subfield2':1, field2:1});

5. Repairing a Corrupt Local Database

Over the weekend, my trusty Macbook shutdown due to battery exhaustion. Usually it puts it to sleep, but I guess it ran out of juice. When I rebooted Ubuntu, upon starting Meteor, I got this nasty message:

[[[[[ ~/Projects/Mahogany ]]]]]
Unexpected mongo exit code 100. Restarting.
Unexpected mongo exit code 100. Restarting.
Unexpected mongo exit code 100. Restarting.
Can't start mongod
MongoDB had an unspecified uncaught exception.

So I basically had a corrupt local database. Riffing off these instructions, I did the following operations specific to the Meteor install of Mongo:

  cd .meteor/local
  rm db/mongod.lock
  /usr/local/meteor/mongodb/bin/mongod --dbpath db --repair --repairpath db1

and all was good again. Ensure that no other instances of Meteor and Mongo are running when you do this procedure.

Gotchas we've found

1. Keys Cannot Contain Periods:

A key cannot contain a period "." or start with a "$" (ref). This is particularly annoying if the hash is an email address like "joe@smith.com". This also occured with folder_name, specifically for the last_uid collection. For Gander, in order to escape the period, I used:

  a = addr.gsub('.','#DOT#') # Ruby encoding
  i = item.replace(/#DOT#/g, '.'); // JavaScript decoding

Unfortunately, this was discovered the hard way: no error message when trying to do an update:

   @user_coll.update({"_id" => user['_id']},{"$set" => {'address_book' => vips } })

The update would fail silently and not generate an exception.  

2. Exercise Caution When Using update

I was testing marking messages for deletion. I wanted to revert the change and retest, so I entered:

   db.emails.update({gander_status:'deleting'}, {gander_status:'gmail'});

Intuitively, you would think that this would simply change the value of the gander_status item. Wrong - it deletes all the other fields, leaving only the gander_status field (and id of course). The correct syntax is using $set

   db.emails.update({gander_status:'deleting'}, {$set: {gander_status:'gmail'}});

3. Mongo Does Not (Easily) Support SSL

See http://docs.mongodb.org/manual/administration/ssl/ and a relevant discussion at: http://stackoverflow.com/questions/11310299/securing-mongodb-transport-in-the-cloud.

4. Count does not take into account skip and limit

Let's say you have the following code:

  • Example 1 - limit()

var x = Emails.find();

console.log("x=",x.count());

var y = Emails.find({},{limit:20});

console.log("y=",y.count());

You would expect:

y = 20 (if x > 20)

  • Example 2: skip

var x = Emails.find();

console.log("x=",x.count());

var y = Emails.find({},{skip:50});

console.log("y=",y.count());

You would expect:

     y = x - 50 (if x > 50)

This is not the case. By default, Mongo's .count() does not take into account usage of skip and limit. So in both of the above examples x = y.  Count() returns the entire cursor's count. Different drivers (e.g. Mongo shellperlJavaScript) have other means of returning the actual expected cursor count. I have not found a way in Meteor's driver to find the adjusted count.

Best Practices

Replica Set Configuration

  • Don't use IP addresses
  • Don't use /etc/hosts
  • Use DNS
    • Pick appropriate TTLs 

See Also

General:

Schema Design:

Indexing and Performance:

Books: MongoDb: The Definitive Guide title says it all. 

Posted
Authorwihl

I spent yesterday afternoon at the Marriott in my beloved Midtown East, learning about data science from a handful of the people most equipped to teach it: Jeff Hammerbacher (@hackingdata), Amr Adwallah (@adwallah), and Josh Wills (@josh_wills).

First up was Cloudera's founder and Chief Scientist, Jeff Hammerbacher, whose other claim to data fame is his stint at Facebook, where he built and lead the data team for two years. He also came up with the term "data scientist," mostly because he wanted to get the then research scientists to get off their cushy high horses and fix database bugs at 2am.

Jeff spoke a bit about getting Facebook's data science team up and running. Everyone started out as jacks of all trades, and didn't start to specialize until the team had surpassed thirty employees. Data scientists are most needed when you have small data teams, because they are so multipurpose/zoom-in-zoom-out.

Jeff asked how many people in the room had the official title of data science, and only about 2 out of 40 or so did. Some people think "Data Scientist" is just a marketing neologism for a job that already existed, but Jeff said that no, the word mattered, because it codified a role and the general duties associated with it: data modeling and analysis.

Posted
AuthorClaire Willett
CategoriesBig Data

Data is a trending topic right now, and data privacy is one of its trendiest subsets. To wit, Charles Duhigg’s investigative report on Target’s data mining for the New York Times spawned a series of follow-ups, in March, The Atlantic profiled  NYU Law professor Helen Nissenbaum and her flow-based privacy framework,  and the FTC just published a privacy report endorsing privacy-by-design and the “Do Not Track” button. The demarcation line between what should be public vs private is a dynamic and jagged (some might say gerrymandered) one that depends on a piece of data’s original context vs the contexts in which it is eventually used. It seems perfectly reasonable for Foursquare to publish its users’ locations but less reasonable for a third-party dating application like Girls Around Me to provide these locations, along with Facebook profile photos, to its users. It seems reasonable that an online money management service like Mint serves up ads tailored to users’ credit ratings, but less reasonable that banks determine applicants’ loan rates based on their Facebook friends’ credit ratings. Because we’re storing and analyzing corporate email, user privacy is something that we have to get right. Of course, an employer’s definition of “right” might be different than the employee’s, so we’ve been trying to figure out a definition that will please both. Companies are legally permitted to access their employees’ email, and usually this manifests in explicit/inappropriate language monitoring. As long as employees are aware of the monitoring, this sort of vocab dinging seems reasonable. But what about sentiment analysis, and the inferred knowledge of employees’ mind states it provides? Invaluable to the company, I think, but potentially detrimental, and sometimes errantly so, to the employee.  Does explicit consent justify armchair psychology and any actions that result?  Even if employees are fully and duly informed of all monitoring and tracking practices, I’m not sure. Take, for example, Cataphora.

Cataphora is a “behavioral modeling and monitoring” software that analyzes employees’ digital and mobile actions from legal, risk, compliance, HR, and brand management perspectives. The copy on its website doesn’t even try to address employees—there are callouts on its news page to articles with titles like “In Defense of Employer Monitoring,” and “Finding Office Buck-Passers, Heroes, and Shirkers.” If employers are not monitoring employees’ digital activity, Cataphora CEO Elizabeth Charnock argues, they are making themselves vulnerable to leaks, blow-ups, and Youtube frittering-induced productivity slumps. In a blog post entitled “Getting Big Brother Right,” Rick Janowski brought up as a use case an employee on the verge of a breakdown due to non-work-related factors. Cataphora could identify and alert management to the employee’s mental state, allowing them to “provide a safety net for someone who might be prone temporarily to making bad decisions or being less diligent than they normally would be.” Aka remove him from fiscal and legal harm’s way before it’s too late. Ooh, Carnival Cruise is having a flash sale! I hear Alaska’s great this time of year!

You could argue that behavioral mining software is just one of the many new “transparent” office measures, which manifest physically in concepts like open and free range offices (a different desk every day!), and culturally in social enterprise platforms like Yammer, Rypple, and Trello. There’s been a push, lately, to besmirch the traditional office, with its many doors and walls and silos. Which is all very well and fine, but there is a point where public property ends and person begins. Perhaps the central tower is too zoomed in to see it.

Posted
AuthorClaire Willett

social tv[Image via Bluefin Labs]

[Updated 03/14/2012]

This ongoing series examines some of the key, exciting players in Boston’s emerging Big Data arena. The companies I’m highlighting differ in growth stages, target markets and revenue models, but converge around their belief that the data is the castle, and their tools the keys. You can read about the first fifteen companies here, here, and here.

16) Bluefin Labs

  • Product: Bluefin Signals is an analytics platform that tracks social media conversations about TV.
  • Founders: Deb Roy (t ln), the former director of the MIT Media Lab, Michael Fleishman (ln)
  • Technologies used: Deep machine learning, language grounding,  semantic analysis, video fingerprinting
  • Target Industries: TV networks, ad agencies, brand advertisers
  • Location: Cambridge

17) Crimson Hexagon:

  • Product: Crimson Hexagon is a BI platform for social media monitoring and analysis. It uses a patented human-trained statistical analysis algorithm to analyze unstructured text.
  • Founder: Gary King (t ln)
  • Technologies used: Machine learning,  automated content analysis, NLP
  • Target industries: Social Media and Digital Marketing, Social Media Consulting, Online Search
  • Location: Boston

18) Dataxu

  • Product: DX3 is a digital marketing management platform with analytics, interative report visualizations, and audience, inventory, and campaign management features.
  • Founders: Mike Baker (t b), Bruce Journey (ln), Willard Simmons (t ln)
  • Technologies used: retargeting, complex event processing, combinatorial optimization, real-time data decisioning
  • Target industries: Mobile Networks, Mobile Ads, Digital Ads, Auto, Financial Services
  • Location: Boston

19) Icosystem

  • Product: Predictive analytics for complex networks. Hunch Engine is a recommendation engine for applications that recommend items to users based on implied preferred characteristics.
  • Founder: Dr. Eric Bonabeau (t ln)
  • Technologies used: agent-based modeling, genetic algorithms, recommendation engine, complex network analysis
  • Target industries: Healthcare, Pharma, Gaming, Natural Resources, Advertising
  • Location: Cambridge

20) Cloudbees

  • Product: Cloudbees is a PaaS for Java consisting of a developing platform (DEV@cloud) and a deployment platform (RUN@cloud). Users can deploy applications to any public cloud, hosted cloud, or private on-premise data ceneter. Jenkins is a Continuous Integration server available as an open source project and an enterprise subscription.
  • Founder: Sacha Labourey (t b)
  • Technologies used: IaaS, PaaS, CI, Java, JVM-based languages + frameworks
  • Target industries: SaaS, Storage, eCommerce
  • Location: Woburn

 

 

Overview

Like the previous post, we picked out a bunch of questions we wanted to answer through analytics on big data. These were the sorts of questions we thought a marketing director might ask. Questions like:

  • Over a given period, what were the trending topics in our email conversations?
  • What are the most popular n-grams?
  • What are the subjects of the liveliest  discussion threads?
  • What is the usage of a specific term over time?

Trending topics

For a marketing director, it might be good to know what folks are talking about. Using the Enron email corpus, we went back in time and pretended that 02/07/2002 was today. We then wanted to see what the most popular terms were for some prior months, and how many times they were mentioned. It looks like this:

This chart looks a little zany at first, and we could have cleaned it up with some smart filtering. We thought we'd let the raw data show through. Basically it is sorted by the first digit of the date. The dates like 1/2 appear because that is literally when the email thought it was sent. I guess the Romans had email after all. Weird data like this is one of the pitfalls of unstructured data. Of note, the most popular term in October of 2001 was enron, right when the scandal broke. The froms, thats, and thisses [sic] are because we didn't filter common words like that out enough - something that needs to be done to get to the more meaningful underlying topics.

Most popular n-grams

First of all, a n-gram in this sense is basically a common contiguous sequence of words. So an example n-gram is literally "a common contiguous sequence of words." Again, we had issues with filtering out all the cruft to get to the heart of the discussions, but with more time and effort we could have found them. We present what we found anyway:

A lot of the n-grams above are actually from the boilerplate confidentiality notices many companies attach to the bottom of their emails. Stuff like "TD TD class TD2 ALIGN" is from emails being sent as HTML instead of plain text. We can see that Vince J Kaminski is a popular guy.

Liveliest discussion subjects

This analysis is actually a lot like the n-gram analysis, except instead we are looking at specifically what shows up in the subject line. Basically, we counted up the number of emails that contained each subject line, and ordered by the most common. The top subject is actually a totally blank subject line.

Occurrences of a term over time

This is a bit like the analytics Google provides. We wanted to see how much a particular word, in this case Enron, showed up in emails every month. That way we can track how hot that term is currently, or in the past. Once again, we can see that Enron has a spike in October 2001, but was actually most popular in May. The left most dates are from those weird emails with bogus dates - we can also see that Enron doesn't occur very often in them.

Wrap up

This concludes the basic analysis we did aimed at Marketing Directors, or pretty much anyone interested in finding out what words and topics were the most popular from email. We did all of this on email partially due to the utility and size of the Enron corpus, but these same techniques apply to the Twitter firehose or anything else. Have a question? Something else you'd like to see? The comments await!

 

Posted
AuthorNick

  The sessions I attended on the final day of the Strata Conference converged around ethicality, legality, and human nature. Earlier, someone tweeted that the data is here, and the talent will catch up. This is true, and the real question is, once the talent's caught up, what will they do with their catch (or cache)? It's a question of volition, not ability, and as such, it is rather difficult to answer. My first session of the day, "If Data Wants to Be Free, Is Privacy a Prison?" focused, as Solon Barocas put it, on "the privacy implications of using public data to predict an individual's private propensities." There has been, recently, an outcropping of data usage cases where the line between what is public and what is private were blurred: the FBI's GPS surveillance, the suicide of Tyler Clementi, Target's recent pregnancy marketing debacle. The Supreme Court voted 8-1 that the car was an extension of the home, the one realm where privacy is generally considered sacrosanct. Generally. The Ravi trial has just gotten underway, and public opinion seems to side against the webcam-happy teen, but the public has been reared on after-school specials. (Ian Parker's piece in the New Yorker paints a more nuanced portrait of the situation and the parties involved.)

The Target case is, to me, the most interesting, because it is an example of Mosaic theory, using big data (here, running analytics against a data warehouse of Guest ID activity) to harvest a wealth of seemingly innocuous public information that nonetheless allows the harvester to infer potentially sensitive information about specific customers. Illegal? No--I'm sure Target has a very thorough terms of service buried somewhere on its site. But unethical? Maybe. Daniel Tunkelang tweeted that banning inference is akin to thought crime, and I see his point, but if the inference is algorithmically derived, is it thought or fact? Barocas said that Target's recalcitrance to ask its customers "are you pregnant" should have been an indication that the question was too sensitive to infer. I agree. Tip for man and machine alike: never ask a woman if she's pregnant!

The internet wants to be free, but many of its users want their data to be freed, and given the potentially brutal results of its being accessed and used to identify (imprisonment, loss of employment, destruction of property, even death), one can hardly fault them.

However, while data can be used against individuals, in the aggregate, the solutions it produces can be critical. Another of the day's sessions,"It's Not [Junk] Data Anymore," with Ben Goldacre, Kay Thaney, and Mark Hahnel, approached the public/private issue from a research perspective. When it comes to data sharing, researchers are at the opposite end of the spectrum from the general electronic device user: they have to be coaxed to share all but the most glorious results. Goldacre noted that between 1/3-2/3 of medical trials don't get published, and these tend to be the ones with actively negative or statistically insignificant results. Add that to science journals' penchant for false positives, and the average citizen has access to a worrisomely incomplete or inaccurate portrait of disease and medications.

To fill it in, Goldacre suggested that we "encourage sharing, mandate publication, and provide a common structure" for raw research, which tends to be ill-suited for long papers, anyway. I fear mandated publication could backfire, but the the other components are promising. Hahnel's company Figshare makes it very simple for researchers to upload their research in a variety of forms, and provides them with a breadth of metrics as carrots. A simple approach with a clear goal and a clean UI, it seems to be garnering a lot of attention in the data science world, and I hope it has legs.

My last session of the day was Robbie Allen's "From Big Data to Big Insights." Allen's company, Automated Insights, makes a software that creates automated content for a wealth of sports blogs, real estate and neighborhood watch blogs, financial tear sheets, insider trading reports, weather sites and other news sources that traditionally have a high proportion of quantitative content. As a writer and linguistic enthusiast, I find the concept of automated content, which assigns anything from a quotation mark up to several paragraphs to a key value, both fascinating and frightening. On the one hand, it's a great grunt work tool, but its ability to mimic the styles of human authors could lead to a map > territory situation. If the simulacra is good enough, what chance has the real?

That being said, if software can mimic an individual author's style, perhaps it can also scramble or dilute it. As Solon Barocas said, anonymizing is extremely difficult; using an automated content program to spin stories from bare gists would be a godsend for those who want/need their words to remain anonymous.

In MacLuhan's global village, do the doors have locks?

Posted
AuthorClaire Willett

Overview

Much like the previous entries, we wanted to take various analytics tools for a spin while also trying to answer quasi-real world queries. We're using Datameer and Karmasphere this time around, and our data source is the Enron corpus. This time we're going to look at queries in two different categories - one in product evaluation and another two in IT. Short and sweet. The questions we want to answer are:

  • Who's been communicating with company X the most?
  • How many duplicate emails are there?
  • How much space do the duplicate emails take up?

Product Evaluation

If a company starts evaluating one of our products, it only makes sense to have the folks who've been talking to them the most follow up on their eval. The question arises: who is that person? We specifically decided to figure out who in Enron had been talking to TXU (a Texas oil company) the most:

As one can see, it'd probably be best to have Farmer, Tisdale, or Hanks do any followups, supposing that they are in the correct department.

Duplicate emails

Emails can take up serious hard disk space. A company may not want to lose all record of an email by deletion, but what if they only deleted duplicates? How would they find them? With Karmasphere, one can write a query like this:

SELECT body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender, COUNT(*) FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( COUNT(*) > 1 )

This query basically groups all emails that have the same headers together, and spits them out. We could then count the occurrences for each of these to find out how many duplicates we actually have. Is this worth it though? How much hard disk space do we actually save? Well, assuming one byte per character in each email, we can do something like this query to get an approximation:

SELECT copies, SUM(product) FROM (SELECT body, COUNT(*) as copies, LENGTH(body) as body_length, (copies - 1) * LENGTH(body) as product FROM enronData GROUP BY body, timesent, subject, torecipient, bccrecipient, ccrecipient, sender HAVING ( copies > 1 )) tabulation GROUP BY copies

This spits out a long list of emails and their sizes grouped by their duplication. It would only be a simple matter of summing all of that up to see the actual wasted space. Unfortunately, all of this effectively adds up to a single number, so no charts this go around.

Wrap up

This concludes our segue into analytics for various roles. We had fun taking a spin with the various tools! If there's some aspect of any of this you have a question about, or if there's an angle of analysis you'd like to see, drop us a line in the comments below!

Posted
AuthorNick

Overview

One of our goals while checking out the analytics packages on the market was to actually generate some relevant reports. The reports in this post are targeted to be of general use to a sales manager. As our primary data source for doing these is the Enron corpus, they are all based on analyzing email. The idea is that we are answering a specific query, such as:

  • What months have the most email volume?
  • What time of day are most emails sent?
  • How quickly do we get responses to our emails?
  • Are we sending emails out to customers with inappropriate language?

We used Datameer for practically all of our reports.

When are emails sent?

Generating these general types of reports is pretty simple. We have a list of every email with their headers, including when they were sent. From this, we can group all of the emails together that have the same component, like the month of the year they were sent, or the time of day. This just gives us a quick look at general long term email trends. The results look like this:

From this we can pretty much gather that fewer emails are sent in the summer than in the fall/winter (the underlying email data actually ended in March, so some months are a bit overrepresented, however). From the below chart, we can see that most folks send emails around 10 AM. If you want your sales pitch to be at the top of the inbox when someone is looking at their email, that would be a good time to send it. Or possibly at four in the morning, surprisingly.

When are we getting responses?

These reports effectively build off the previous ones. They are also much more interesting. The below chart shows the mean time, in minutes, before a particular sender gets a response to their emails from someone outside of Enron. If these were all sales emails, it could indicate that sales folks with the lowest response time are sending the most effective emails.  The below chart therefore indicates that Dan Boyle is doing something right. Maybe the other salespeople should copy him.

Are we sending inappropriate language?

Of course, Dan Boyle might just be getting his emails replied to so quickly because they are filled with profanities. The underlying report for this matched words in emails with a long list of profanities. It then displayed a list of every email and the profanities that were found there. It's not actually all that nice to look at. I thought a top ten would be better. This is cultural anthropology, folks:

Wrap Up

It's pretty easy to make rather simple analysis like the above, and on top of Hadoop they don't take all that long to process. If you want to take a stab at trying to make your own reports, you can use Timberwolf to ease the pain of getting email data into HBase straight from an Exchange server. Do you have a specific report you'd like to see? Want to know how to make your own? Hit the comments!

Posted
AuthorNick
CategoriesBig Data

I'm combining these two because the first doesn't require much ink. Big Data: Wall Street Style

Featuring: Jeff Sternberg, Jen Zeralli

This was a pure sales pitch for S&P Capital IQ. To be fair, some of the functionality behind their dashboard, especially the "companies you may be interested in" recommendation engine, is pretty cool, but a) I was hoping for some dirt on black box algorithms and b) SPCIQ's web-front end has an offputtingly bad, 1995-all-html-all-the-time UI.

No shilling, but a few facts:

  • Of the more than $2.35 trillion that has been invested in IT over last 10 years, the amount invested in Big Data technologies comes to somewhere around 4%.
  • SPCIQ gets 67k docs/day, which are stored in a document repository comprised of SQL (for the metadata), a filesystem, and Solr/Lucene for searching.
  • For their recommendation engine, they use signals from Hadoop and Hive to score each suggestion for each user.

That's pretty much it, IMHO. Onto 'flix.

Netflix Recommendations: Beyond the 5 Stars

In this session, Xaviar Amatriain (@xamat) dissected the anatomy of a Netflix recommendation. Good stuff, though he was really hard to hear. Some facts:

  • Netflix recommendations are per account, not per person, which is why, as one Twitterer noted, your eight year-old is told she might enjoy Cape Fear.
  • The "continue watching" button is a very important recommendation validation
  • Netflix uses a combination of implicit (tracking user behavior) and explicit (asking users "this or that or the other" questions) methods to set taste preferences. They also take freshness and diversity to determine genre selections.
  • Netflix's similars are computed from different data sources including metadata, ratings, and viewing data, and can be treated as data/features. They are used in response to user actions.
  • Ranking of films uses popularity as a baseline, and is determined through a combination of scoring, sorting, and filtering, with the goal of finding the best possible ordering of a set of videos for a user within a specific context in real-time. Whew.
  • Predictive film ranking is akin to CTR forecasting for ad/search results.
  • Perhaps most importantly, Xavier is in the "a few good algorithms > massive amounts of data." Netflix found that beyond a few thousand trained samples, the accuracy of their recommendation model levels out.

 

 

 

 

 

Posted
AuthorClaire Willett

Featuring: Marcel Salathé Salathé's bread and butter is infectious diseases, but his research group at Penn State has found that the disease contagion patterns can be applied to social behaviors as well. Which is good, as he thinks that in the near future, understanding social dynamics is going to be as important as understanding the spread of germs. In this session, he took us through the definition and dynamics of social contagion.

Let's say Al bought a segway, and his neighbor Phil, upon seeing Al chugging merrily along astride it (can you be astride a segway?), is impelled to go buy one for himself. Contagion. Butttt, let's say Al buys a segway, immediately goes on vacation, and while he's in Aruba, Phil buys a segway. Not contagion. In both cases, the two men are now connected, but in the former case, the connection (owning a segway) is the result of the men's similarity; in the the latter, it is cause of it.

The first case is an example of homophily, or the "birds of a feather flock together" phenomenon, and Salathe says that if you account for it, almost all contagion disappears. I'm not quite sure how this works--a cluster of points may have a smaller outside edge than the same number of scattered points, but, unless we're talking a completely isolated community, the edge is still there.

Salathé used his H1N1 Vaccine Twitter study to illustrate the dynamics of social contagion. He and his students collected all H1N1-related tweets, ran sentiment analysis on them, and used natural language processing to calculate the average sentiment core over time. They wanted to find out a) whether increased positive sentiment about the vaccine corresponded to increased vaccination rates, which it did. They also wanted to see whether Twitter users with similar sentiment are more likely to be connected (positive assortativity) than are users with different sentiments (negative assortativity), and they found that positive assortativity did increase the change of contagion, though negative assortativity did not. A few interesting Twitter assortivity findings:

  • Your follower count has no affect on your tweeting positively, but does have an affect on your tweeting negatively
  • If you have more negative friends, you're less likely to tweet negatively
  • The more positive tweets you're exposed to, the more likely you are to tweet negatively. This last one goes against the popular "spread the cheer" viral messaging technique.

 

 

Posted
AuthorClaire Willett

Featuring: Billy Bosworth (DataStax, @datastax), Jeremy Edberg (Netflix, @jedberg), STS Prasad (Walmart, @stsprasad), Ed Anuff (Apigee, @edanuff) DataStax CEO Billy Bosworth moderated this panel about the business motives behind and effects of making the distributed computing jump. Edberg, Prasad and Anuff all said it was a matter of sink or swim (or scale up)--they couldn't go on supporting their customers in a meaningful way if they stuck with a data warehouse. Netflix needed a distributed, resilient system; Walmart needed to rapidly process data into what Prasad called "the social genome." All three companies ended up choosing Cassandra.

Challenges in moving to distributed computing:

  • Single-node loss, because it overloaded neighbor nodes
  • Rethinking ways to find and query data
  • Compaction--it causes a performance hit, so Walmart ended up using SSDs to compensate.

Nice surprises:

  • Counters made it easy to implement a system with real-time reporting.
  • The ability to administer Cassandra and get the best from it hasn't translated into need to increase hiring.
  • No need to worry about disappearing data.
  • Speed of negative lookups is much faster than expected.

In conclusion:

  • The more data you put into your system, the better the system gets. (This has been a popular refrain).
  • Relational databases aren't going away--there's a place for both relational and distributed, and most companies will need both. NoSQL for real-time; SQL for batch-processing.
  • Just having the data isn't enough--you need the ability to rapidly extract insight from it.
  • Look to see more innovation in the business application side

Interesting fact: Netflix uses multi-region rings--one cassandra cluster across multiple geographic regions--both for resilience and so its US customers can travel abroad without loss of service.

 

Posted
AuthorClaire Willett

[Image via Datablog]

My original plan for this afternoon was to attend Jeremy Howard and Mike Bowles' session on predictive modeling, but, after a morning of focused web crawls, I decided to go listen to Simon Rogers (@smfrogers) and Michael Brunton-Spall (@bruntonspall) talk about data journalism instead. To cop a Britishism, it was brilliant. Rogers is the pioneering journalist behind The Guardian's uber-popular Datablog, and Brunton-Spall is one of the developers tasked with transforming reams of raw data into journalist-searchable information.

If you haven't ever read the Datablog, you should: it's a model for transparent, accessible business, giving readers a variety of ways to consume news, the numbers behind the news, and the methodology for obtaining these numbers. Datablog does a lot of the UK government's work for them, and a decent amount of our government's as well, turning paper and web documents into public google spreadsheets, interactive charts and visualizations, and editorial stories. As Rogers noted, while data used to be the domain of long-form journalism, our new crawling, parsing, and processing skills make it highly suitable for short-form news as well. It's pretty easy to imagine it becoming a real-time news source (I'm sure Automated Insights would agree).

This session used a bunch of Datablog posts and datasets to illustrate the parts of data journalism, which boil down to:

1) collect sent data, recurring events, breaking news, and theories to be explored

2) figure out what to compare or show change, what the data means, what other data sets to use with it

3) shove the chosen into spreadsheets

4) clean up the data: check for data in wrong format, merged cells, unnecessary columns of data, data measured in different units. 80% of their time is spent here

5) perform calculations on the data. recalculate if needed, sanity check the results

6) map the data in one or more formats (graphics, free viz tools, google fusion table, story, and/or just publish)

While some of the Datablog posts are fairly light-hearted (e.g. US Plastic Surgery Statistics, though that is also a bit scary), most of them offer the public substantiated cultural, institutional, and environmental conclusions,  e.g. that the bulk of the arrests during the UK's summer of unrest took place in its poorest neighborhoods, or that the battle between the 99% and the 1% should actually be between the 99.99% and the .01%. 

To help The Guardian's journalists identify the needles in the data haystack, the developers came up with a guideline they call "The Philosophy of Interesting Information." What qualifies as interesting?

  • metadata, as revealed in Wikileaks cables. US soldiers are much better at entering tags than diplomats
  • the habitual--it betrays the people who published in the info
  • distress
  • anomalies
  • visualizations

Journalists parse datasets for these qualities using Ajax Solr, which puts a more user-friendly interface atop Solr. It includes search, interactive graphs, and tag clouds, and looks quite nice, but is not available to the public.

Occasionally, the Datablog has turned to its readers for help in parsing massive amounts of pdfs. What they've found is that a) you need to recognize and reward contributors for their help or else they'll get bored midway through and b) for the crowd sourced data to be effective, you need people to comb through it. Long story short: cool concept, great for tips, pretty bad for data.

Since much of the Datablog datasets have a geographic component, the journalists often use Google's Fusion Tables to visualize them. There are two types of Fusion Tables that really work: ones with borders and ones with dots. In the last part of the session, Rogers showed us how to create a dot one that displayed where all the session attendees were from, along with age and eye color. If you have a google account, it's incredibly simple.

1) create/upload a spreadsheet/csv

2) create table based on that spreadsheet

3) visualize as map (geocode)

4) set window info (custom or automatic)

One thing to note is that Fusion Tables don't yet work with real-time databases, though the Google API team is working on it.

 

 

Posted
AuthorClaire Willett

Web mining consists of crawling the world wide web, and extracting and processing its structure, usage, or content. This tutorial, taught by Scale Unlimited founder Ken Krugler (@kkrugler), focused on content mining at a large scale (as you might surmise from its title). After a thorough overview of web mining, we did a focused crawl for ultimate frisbee images. The first step in any type of web mining is crawling, or fetching and parsing web pages. There are four types of web crawls: broad (e.g. bingbot), focused, domain, and what Krugler calls the "don't crawl" crawl, wherein you leverage other people's crawl data, which besides usually being faster and cheaper, also reduces the load on your crawlees' servers. If you go for this option, you can either use public datasets like CommonCrawl or Wikipedia or commercial providers like Spinner and Infochimps.

Per Krugler, the general rule of crawling solutions is: don't roll your own! No matter your language, there's an open source option. Java has Nutch and Heritrix, Python has Scrapey, PHP has the more literally named php-crawler. Whichever solution you end up going with, make sure it's reliable, scalable, and fault-tolerant. Sure, a single server can fetch lots of pages, but scaling becomes an issue with post-processing. 

Keep in mind that unless you're creating an index of the crawled sites, you're breaking the implicit "traffic-for-bandwidth" contract, and therefore risk having the wrath of the webmaster reign down on your inquisitive shoulders.

Today's lab had us doing a focused crawl, so Krugler took us through the basics. First, you need to start with good seed urls, or really high-quality pages. You can use a paid/free service to find these, or you can use search and manually enter them or make calls to the APIs. After you have your seeds, you fetch and parse new urls and give them a page score.  When parsing, be sure to normalize the outlines (e.g. http://basketball.com becomes http://www.basketball.com), and use a suffix filter to skip links to low-value pages like images and pdf files.

You can score pages by tokenizing their text, using simple term-based scoring or using a Support Vector Machine, which is trained using "documents" that have features and a positive or negative class, creates a statistical model that divides all the documents into separate positive or negative classes, and then uses this model to assign unknown documents one of these classes.

Graphics and small amount of content (e.g. a definition page)are some of the wrenches in determining page quality. You can filter these pages out by setting a minimum real content threshold. By "real content," Krugler means the stuff that's not Chrome, cruft or boilerplate. You can scrape off this junk with Boilerpipe.

After you have your page score, it's time to extract and score its outline before putting the url into the url state, or the database of all known urls. There are three attributes of extraction: broad, precise, and accurate. You get to pick two, and these depend on whether you're extracting unstructured (broad, accurate), semi-structured (broad, precise), or structured (precise, accurate) data.

Usually, no matter which type of data you're extracting, you'll need:

  • To clean the html. You can use a library like NekoHTML for this. Keep in mind that the end result won't match the original text.
  • A charset to convert bytes to characters. Tika works pretty well.
  • A boilerplate scraper like Boilerpipe.
  • Some means of identifying the page's language (http response header, meta tag, tag attribute, or text analysis).

If you're doing unstructured extraction, your goal is to extract text without much additional processing (often there are just a few HTML fields). If you're doing semi-structured extraction, your goal is to find data in random text. Since it's not very format-specific, this can be applied broadly, often at the expense of accuracy. Easy patterns like telephone numbers, micro formats, and NLP named entities all usually work well with semi-structured extraction.

If you're doing structured extraction, your goal is to extract specific types of data, typically from one area of one site. You often do this with XPath, and if this is the case, Firebug is your friend, as are the div and span tags. If you run into pages that generate javascript, you will need Firebug or an equivalent to inspect the DOM. Options for precessing pages with javascript include HTMLUnit, qt-webkit and headless Mozilla, but keep in mind that processing a page with javascript takes about 10 times longer than just loading page text.

 

Posted
AuthorClaire Willett

[Image via Apache Incubator]

Overview

On 02/22/2012 I attended a Hortonworks webinar detailing the overall capabilities of a new Hadoop tool/layer called HCatalog. The basic premise is that HCatalog provides an interface for accessing data stored anywhere in a regular Hadoop tool (Pig, Hive, or MapReduce) format. This makes it much easier to access data, since custom loaders for each data source become unnecessary.

Presentation

The talk itself was given by Alan F. Gates (github | twitter) who is one of the co-founders of HortonWorks. He's a committer on Pig and HCatalog, and wrote the O'Reilly Programming Pig. HCatalog itself was apparently started at Yahoo. One of the biggest strengths I saw in HCatalog came from some early slides where Alan basically said that "sharing data is hard." The specific example given was that a programmer using Pig might load and process some data, and dump it in HDFS somewhere for an analyst to use. That analyst wants to do their own work using Hive, since it has a SQL-like language they understand. The analyst has to figure out where the data is and then use a rather complicated command to load it into Hive. Then they finally can run whatever it is they want on it, but they still have to do it manually. HCatalog attempts to solve these pain points in two ways. The first, as mentioned, is that is provides a layer of abstraction over the logical location of the data. Pig could instead then store the data into an arbitrary "ProcessedData" table in HCatalog, and the analyst could open that same HCatalog "ProcessedData" table with Hive. Additionally, in doing this, they won't have to worry about transforming the data from a form that Pig outputs into a form that Hive understands. Instead, it just works. The second major strength is that the analyst in question doesn't even need to manually start anything. HCatalog currently provides a rudimentary event system over JMS, so upon completion, the Pig job above could notify the Hive job to start. No manual interaction required, instead, again, it just works.

Operations

There was a segue into operations aspects of using HCatalog. It is capable of treating disparate underlying data structures as being a part of the same table. This means that old data can coexist with new data with new columns, all in the same table. Therefore, calling something like alter table doesn't require reformatting any of the pre-existing data in the table, only the new table. Missing columns in the old table simply get nulls. Another operations aspect is due to the hiding of the underlying file locations, tables can be physically moved around without causing issues with user applications.

Future Work

The next release, 0.4 should come out next month. It will include the Hive, Pig, and MapReduce information. It can currently support any specific data that has a Hive SerDe (currently Text, Sequence, RCFile, JSON). This is because HCatalog just uses the Hive formats underneath. The JMS event notification will also be present. They claim to have "basic" HBase integration, but not what that entails. For future versions, they are hoping to improve said integration, particularly for the new security features. Currently HCatalog relies entirely on what is present in HDFS to perform its security model. They are hoping to soon have a complete REST API over JSON.

Future Directions

Basically, they want to be able to store semi and unstructured data. They did not go into details about how. They did go into some detail about the data lifecycle process, and how HCatalog can fit into a few bits of these. One example was archiving, for legal reasons, etc. Most archiving goes to another Hadoop cluster or a data warehouse. Another area is replication, specifically trying to get the same data sets for global companies all over the world. Compaction is generally performed on data more than a few days or a week old, and they currently get stuffed into .har files, a HDFS archiving format. The really old stuff gets deleted in the cleaning phase of the data lifecycle. The way HCatalog fits into the data lifecycle process is by providing basic implementations and interfaces for them. For example, metadata on a HCatalog table could say delete after a month - this would be a basic implementation. The interface could allows more sophisticated plugins to change this behavior.

Another area they want to look in is partitioning data on different storage. It would be awesome if new fresh data could be stored in HBase, to be looked at piece by piece, and then after a few days be shoved into HDFS to be used for batch processing. They would like to expand the capabilities of HCatalog to other massively parallel processing platforms too, like Cassandra and MongoDB. Most companies have a bunch of different storage platforms, so supporting multiple data stores makes things easier for everyone to work with Hadoop. One last piece of future work includes storing HCatalog metadata in HBase instead of an RBDMS, because oftentimes there is simply too much metadata.

Posted
AuthorNick

This time next week, if all goes as planned, I’ll be on a plane back to Logan with a head full of data, a twitter account full of people who work with data, and a suitcase full of shorts. The Strata Conference, which takes starts this Tuesday, February 28th, and goes through Thursday, March 1st, is one of the country’s preeminent data science conferences. Certainly its sessions, which, in addition to the data scientists themselves, target CTOs/CIOs, marketers and journalists, are the broadest a’beam of any I’ve seen. And its speakers are (deservedly) bold-faced names to even the greenest of data geeks: Doug Cutting (Cloudera), Ben Goldacre (Bad Science), Hal Varian (Google), Mike Olson (Cloudera), JP Morgenthal (EMC), Alistair Croll (Bitcurrent), Usman Haque (Pachube), Coco Krumme (MIT Media Lab), O’Reilly’s own Edd Dumbill… It’s a lot to take in, especially if you, like me, are the type who panics at buffets and department stores. So, to help the both of us, I’ve made up a “Can’t Miss List (for the Social Strategist).” Annnd, since I’ll warrant that many, if not most, of you are not social strategists, I’m also giving you a “Can’t Miss List (for the CEO/Chief Scientist,” courtesy of our own CEO, David Wihl. You can thank, or debate, with both of us in the comments, on Twitter, or, best yet, at the show!

Can’t Miss List (for the Social Strategist)

Name Handle Presentation Time
Ken Krugler kkrugler Large scale web mining Tues, 9am
Mike Bowles

Jeremy Howard

mike_bow

jeremyphoward

The Two Most Important Algorithms in Predictive Modeling Today Tues, 1:30pm
Billy Bosworth Data as a Strategic Weapon - Walmart, Netfix and Apigee Panel Discussion Wed, 10:40am
Marcel Salathé marcelsalathe Understanding Social Contagion Wed, 11:30am
Jesper Andersen jandersen Building a Data Narrative: Discovering Haight Street Wed, 1:30pm
Philip Kromer mrflip Disambiguation: Embrace wrong answers & find truth Wed, 2:20pm
Xavier Amatriain xamat Netflix recommendations: beyond the 5 stars Wed, 4:00pm
Christopher Berry cjpberry Data Science in Marketing Analytics Wed, 4:50pm
Jim Adler

Solon Barocas

jim_adler

s010n

If Data Wants to Be Free, is Privacy a Prison? Thurs, 10:40am

 

 

Nathan Marz nathanmarz Storm: distributed and fault-tolerant realtime computation Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Ben Goldacre

Kaitlin Thaney

Bengoldacre

kaythaney

It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Mark Hahnel figshare It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm

 

Can't Miss List for the CEO/Chief Scientist

Name Handle Presentation Time
Michael Rys sqlservermike SQL and NoSQL Are Two Sides Of The Same Coin  Tues, 9:00am
Claudia Perlich From Knowing ‘What’ To Understanding ‘Why’  Tues, 9:45am
Monica Rogati mrogati The Model and the Train Wreck: A Training Data How-to  Tues, 11:00am
Jacob Perkins thedatachef Corpus Bootstrapping with NLTK  Tues, 11:30am
Ben Gimpert someben The Importance of Importance: An Introduction to Feature Selection Tues, 12:00pm
Matt Biddulph mattb Social Network Analysis Isn’t Just For People  Tues, 1:30pm
Robert Lefkowitz r0ml Array Theory vs. Set Theory in Managing Data  Tues, 2:15pm
Robert Lancaster rob1lancaster Survival Analysis for Cache Time-to-Live Optimization  Tues, 3:30pm
Eric Badeschwieler jeric14 The Future of Hadoop: Becoming an Enterprise Standard Wed, 10:40am
Alexander Stojanovic stojanovic Unleash Insights On All Data With Microsoft Big Data Wed, 11:30am
Pascal Boillat Changing Data Standards from Wall Street to DC and Beyond Wed, 1:30pm
Jen Zeralli Big Data: Wall Street Style Wed, 2:20pm
Kuntal Malia Analytics in a Community-Driven Fashion Retailer Wed, 4:00pm
Leigh Dodds ldodds Linked Data: Turning the Web into a Context Graph Wed, 4:50pm
Kirkland Barrett Democratizing BI at Microsoft: 40,000 Users and Counting Thurs, 10:40am
Stefan Groschupf datameer Hadoop Analytics in Financial Services Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm
Posted
AuthorClaire Willett

This week, I took some time to evaluate Karmasphere  Analyst. Particularly, I was interested in how it worked with Hadoop (as opposed to MapR, which it also supports).

Setting up

The setup for Karmasphere is rather painless: a simple installer on windows and a shell script on Linux. However, the windows version does require cygwin. Once open, Karmasphere divides itself into three major steps.

Access

This is where you set up connections to existing HDFS databases. Karmasphere only supports Hive, but it's pretty nice about it... kind of. It will go through the process of installing Hive for you through a rather nice GUI, which allows you to easily specify a Derby database, MySQL database, or whatever other database you have a Java connector for. The downside to this is you can't easily use an already-existing Hive installation. This was a major shortcoming for me, but I get the impression that it should be possible to import an existing Hive database. I'll let you know as soon as the Karmasphere rep gets back to me.

Assemble

Once I decided to install a new Hive metastore (which was rather painless), importing new tables from sequence files was simple for all the steps that involved Karmasphere (making the sequence file was annoying though). I don't have a problem with how Karmasphere does this. My only real problem is that it seems to hide away the shell that interacts with the Hive cluster Karmasphere uses, which seems like it might be limiting. I could be wrong, but I don't see how you could ever import anything without working through Karmasphere.

Analyze

Supposedly, this is where the magic happens. The interface here was much simpler compared to other analytic tools. But that may be because there is not fancy drag-and-drop interface, or amazing visual features. It turns out Karmasphere is a glorified query writer. But in its defense, it's very glorified. I've written queries against Hive before, but I've never managed to write them as quickly or as painlessly as Karmasphere allows me to. The bells and whistles it brings to the table include:

  • immediate and clear feedback regarding any errors or warnings in your queries
  • one-click execution of any written queries
  • caching of past queries and results
  • effective sampling of data to test queries on smaller subsets
  • Table, column, and function library indexes
  • A "Query Plan" which shows you just how exactly your query will translate into Hadoop map-reduces

Once you have your data, it's pretty simple to export that data into various useful mediums such as Excel files, SQL tables, or perhaps back into Hive. Also, there is some charting functionality that was relatively simple to use, although I didn't look too much into it since it wasn't of interest to me.

Conclusion

All this makes the tool worthwhile, but I'm not sure it's worth the price (we were unable to obtain pricing information at time of publication, but will update if they get back to us). Since ultimately, you are just making queries, it doesn't add any additional analytic functionality that we couldn't do before. Technically, once you make your query, you don't even need Karmasphere anymore. Although once you have your data, it does let you do several things with that data that would otherwise be difficult to do (export, graphing, etc...).

If you're looking to analyze your unstructured data, I would say Karmasphere is ill-suited for the task, as unstructured data tends to take more than just the SQL-like queries Hive offers. All in all, this product is useful. But once my trial runs up, I will discontinue use.

Posted
AuthorCameron
CategoriesBig Data

Overview

One can always analyze massive amounts of data with custom map reduce jobs on Hadoop, but usually it's a lot easier to use a pre-packaged analysis tool. Recently in-house we've been experimenting with a number of different analysis tools, and one of our favorites so far has been Datameer. As an analysis tool, it's pretty powerful out of the box and has a ton of capabilities for expansion.

The general workflow pattern for using Datameer is a bit different from many of the others tools we've used, but we really like the Excel-like nature of one of the main steps to generating a report. In general, one can figure everything out just by playing around. (For trickier tasks, they've provided a pretty good set of documentation.) The report we're going to create herein is very simple, but it will illustrate the basic steps required to work through the Datameer workflow.

Our starting point is that we have a subset of the Enron email corpus in one of our HBase tables. Getting email data into HBase is a little outside the scope of this article, but one could use our own Project Timberwolf to get email data out of Exchange and into HBase if they wanted a simple way of getting some starting data. The actual report we want to generate is indeed very simple - who sent the most messages? In the end, we'd also like to see the results of this in an actual chart. Let's get started.

Setting Up Datameer

There are Linux, OSX and Windows flavors of Datameer available, but for simplicity's sake these instructions use the Windows flavor. A trial version of Datameer can be found here. Simply run the installer like any other, and open the newly installed Datameer Analytics Solution (DAS) launcher. Eventually, a launch button will appear in the launcher. Clicking this Launch button will open the DAS in a browser window. The DAS actually runs as a web-based application, and on Windows will pick a random port to open in on localhost by default. Logging in with the default username and password of admin/admin will bring one to the Uploaded Files screen.

Now we can start creating our report.

Adding a Data Store

All of our email data is tucked away in HBase at the moment, so in order to access it we'll need to let Datameer know about our HBase instance. We can do this by first clicking on the Data Stores tab on the left and then the New button. For our type, we selected HBase 0.90.1 (it was the closest) and then Next. The following page wants to know about one's Zookeeper details - modern versions of HBase all work by accessing Zookeeper instead of HBase directly. We entered the quorum appropriate for us, and the default Zookeeper port is 2181.  We didn't select any permissions options, and on the next page we didn't add a description either. We saved the data store as "demo_ds". Upon saving, we're once again presented with a list of data stores, with "demo_ds" as an option.

Creating a Data Link

We've told Datameer about our HBase data store, but we still need to create a data link which is how we actually access our email data. To start this, we first click the "Data Links" tab on the left, and then "New," and we're given the option to select a data store. Select the one we just created, in our case "demo_ds". The next page asks us to choose a table for the data link, which is determined by what tables are actually in one's data store (ours was actually called "enron"). For the purposes of these instructions, one can just skip the rest of the new Data Link queries and go all the way to Save. We saved our data link as "demo_dl". The Data Links overview page should contain the new data link.

Creating a new Workbook from a Data Link

Workbooks are how we actually manipulate our email data in Datameer. They are analogous to a column-only Excel workbook, meaning that one only manipulates data by entire columns, and not individual cells or rows. One of the easiest ways to create a new workbook is from an existing data link. From the Data Links tab, if we click directly on the data link we just created, "demo_dl", and then click the Refresh Sample button on the resulting page, we'll be greeted with a view like this:

To create our new workbook, click the Link Data in new Workbook button.

Creating the Data for our new Report

After clicking the Link Data in new Workbook button, we'll be brought to our new workbook and the default worksheet, which is a view on all of the data as stored in our data store.

Default Data Link Worksheet

Each row represents a single email. Note how similar this view is to Excel. Remember that our end goal is to have a chart which tells us who sent the most messages. To do this correctly, we'd like to have a single worksheet which has the data setup perfectly for our chart at the end of this step. Basically, it would contain an email address in one column and the number of emails its owner sent in another. At the bottom of the worksheet are the worksheet tabs. We can just click on on the "+ New" button to create a new worksheet. It's probably best to rename the worksheet to something better than Sheet1. Right-clicking on the sheet name tab will bring up a context menu with a Rename option. We decided to rename the sheet to "SenderCount". To get data into our new worksheet, we can click anywhere in a column and enter a formula. Generally, we want to get data from elsewhere in our workbook, like our default worksheet created by our data link. We can do this with a formula like "=#demo_dl!h_Sender", which will grab the "h_Sender" column from the "demo_dl" worksheet. The result looks like this:

Simple Function

For clarity, let's rename column A in our new worksheet to something intelligible, like "Sender". We do this the same way as renaming a worksheet, by right-clicking and selecting Rename. Remember that in our data, each row represents a single email. Thus, the exact same sender email address will appear each time that user sent a message. This means that there will be duplicate entries in the column for anyone user who sent more than one email. What we'd like to do is group all of the senders by their email address and count how many times they appeared. We can do this with Datameer's grouping functions. In particular, we are interested in the GROUPBY and GROUPCOUNT functions. GROUPBY will group all the records where each entry in a specific column are the same. GROUPCOUNT will count the number of records in the group. If we change the Sender column's function to "=GROUPBY(#demo_dl!h_Sender)", we will group all of the records in the Sender column. Our worksheet looks like this:

Simple Groupby

Note that all of the duplicates have been removed via grouping. To get the actual count for each group we can enter in "GROUPCOUNT()" for the formula for column B. Note that the actual count for each group is the same thing as the count for each sender, and since each row represents one email, it's also how many times each sender sent an email! We should rename column B to "Count" while we're at it. The worksheet should now look like this:

Groupcount

We're on the home stretch for getting our data set up. We now know how many times each sender sent an email, but we want to know who among them did it the most. To figure this out, we'll need to sort our sheet. There's a button for doing this, named "Sort Sheet," above the worksheet. We want to sort over the Count column in a descending manner, and we only want the top ten entries. Applying the sort will actually create a brand new worksheet (which is read only). We renamed ours to "SenderCountSorted". The resulting worksheet should look like this:

Sort

Now save the workbook! We called ours "demo_wb". We still have to run it at least once for it to have data in it, so click the Analytics tab at the top, select our new workbook by radio button, and click the Run button. After a short time, the workbook should be populated with data from the data store.

Creating a Chart for our Report

We finally have our data in a presentation format that will be easy to create a chart out of. In Datameer, groups of charts and other widgets are called dashboards. Clicking the Dashboards tab at the top and then New will allow us to create a new dashboard. Select the dashboard to only have one column and leave everything else as default. We decided to name ours "demo_dashboard" on the Save screen. Our new default dashboard should look something like what's below.

Default Dashboard

If we want to show our email counts as a bar chart, we need only drag the bar chart widget on the left into our single column on our dashboard. Clicking the configure button allows us to populate it with data. We want to take our data from our new workbook, "demo_wb". On the next screen, we'd also like to use the SenderCountSorted table. On the last page, for readability, under the Style tab is the X Label Rotation option. Switch it to vertical. Every proper chart should have a title, and ours is no different. Under the Label tab is a title text field. We entitled our chart: "Top Ten Email Senders in a Subset of the Enron Corpus." Clicking save will bring us to a view of our new chart, like this:

Chart

This is actually still a view on the dashboard editor, so we'd be better off viewing it in view mode. If one clicks "Dashboards" on the top and then right on the "demo_dashboard" dashboard, they'll see it in view mode. It looks pretty rad.

Conclusions

As far big data analytics go, this process is pretty painless. Assuming HBase is good to go, and already has all of the data we need in it, creating the workbook and resulting chart in Datameer should take less than a half hour. If we needed to actually hand write our map reduce jobs to calculate all of this, it would take orders of magnitude longer, and in the end, we wouldn't even have a pretty chart! This is just the very tip of what one can do with Datameer on email data, so we recommend playing around with it and see what sort of reports they can make. Make a useful report? Have a question about all of this? The comments await!

Posted
AuthorNick
CategoriesBig Data