A blog about security, privacy, algorithms, and email in the enterprise. 

Viewing entries tagged


Boston Hadoop Users Group Meetup: More Data vs. Better Data vs. Better Algorithms

[Image via Tech Crunch]

By this point, I’m guessing you’ve heard the term “big data” bandied about a few times. You’ve probably seen more than your fair share of tweets, blog posts, and Wall Street Journal articles with titles like “What’s So Big about Big Data,” and “Hadoop: the (New) Elephant in the Room”. In case you haven’t bothered to read any of them, what’s so big about big data is: insight. Insight that goes well beyond site A had 35 unique visitors on Monday April 17 and salesperson B sold 14 chickens last quarter. Insight that can tell you: site A will have around 49 unique visitors on Monday, April 23 if you write an article about chickens and put it to the right above the fold. Insight that incites revenue-generating actions.

The question is: what is the best way to attain this insight?

There are a few schools of thought on this. The first prioritizes sheer volume of data. The second wants only high-quality data. The third says: data schmata, all I need is a killer algorithm. Which camp is right? During this meetup, we’ll attempt to find out. To help us make the decision, we’ve rounded up some of Boston’s and one of San Francisco’s preeminent data scientists, who will present reasons and real-world scenarios for why more data, better data, and/or better algorithms are the key(s) to ecumenical insight. Their sessions and bios are as follows:

  • Speaker: Paolo GaudianoIcosystem:
  • Arguing for: Better Algorithms
  • Session: It is often thought that the accuracy of a model depends heavily on data quality and quantity. However, the notion that numerical data are the only type of information needed to build an accurate model is flawed. We present a modeling approach that combines domain expertise and quantitative data to demonstrate that predictive models can be developed without quantitative data, and that in general any model built with both quantitative data and domain expertise will outperform models developed with either type of information alone. We will also mention real-world situations where this approach has been applied successfully.
  • Bio: Paolo Gaudiano is President and CTO of Icosystem, where he enjoys solving challenging business and technology problems for clients, while striving to ensure that Icosystem continues to be a stimulating, productive and fun company. He also serves as interim CEO of Infomous, Inc. and President of Concentric, Inc., two spinoffs created by Icosystem. After starting an academic career at Boston University, Paolo left his tenured position to pursue entrepreneurial opportunities with two start-ups, Artificial Life (as Chief Scientist) and Aliseo (as Founder and CEO). In 2001 he joined Icosystem, where he is able to nourish his multifaceted, interdisciplinary interests. He also continues to satisfy his passion for teaching through a position as Senior Lecturer at The Gordon Institute of Tufts University, and through a variety of speaking engagements. Paolo holds a B.S. in Applied Mathematics, an M.S. in Aerospace Engineering and a Ph.D. in Cognitive and Neural Systems.
  • Speaker: Christopher Bingham, Crimson Hexagon
  • Arguing for: Better Algorithms on More Data
  • Session: Often, analyzing more and more data doesn't improve your results: you just make the same mistakes at a larger scale.  We'll discuss several techniques that leverage the quantity of data, increasing accuracy as you scale.  Big data can thus lead to better analysis--not just bigger analysis.
  • Bio: Chris Bingham is the CTO and first employee of Crimson Hexagon, a leading provider of business intelligence based on social media analysis.
  • Speaker: Jeremy Rishel, Bluefin Labs
  • Arguing for: "D: All of the Above"
  • Session: At Bluefin Labs we analyze social TV at large scale, with 24/7 realtime systems looking at the content on over 100 networks and the conversation and audience dynamics about brands, advertising, shows, and more in public social media. The analytics derived about engagement patterns and audiences provide rich insights for brands, agencies, and TV networks. To do this we pursue "all of the above": more data, better data, and better algorithms. "More data" comes in many forms, including richer content streams and more granular sources. By including the broadest spectrum of data we're able to gain insights not possible in other ways. "Better data" in our world comes from a fundamental approach of human-machine collaboration and data management that permits us to achieve consistent high data quality. Finally we are always pursuing "better algorithms", for example in understanding the connections between audiences, as both we learn more about social TV patterns and engagement dynamics evolve. I'll be discussing some examples of each from the Bluefin platform and why all three - more data, better data, and better algorithms - are necessary.
  • Bio: Jeremy heads up Bluefin Labs' engineering, product, and data efforts. Jeremy was formerly the CTO and VP of Engineering at aPriori Technologies, which developed a groundbreaking approach to real-time analysis of complex design and manufacturing data to predict manufacturing methods and costs. Prior to that he led teams at i2 focused on transportation planning and optimization. Rishel earned BS degrees in Computer Science and Philosophy from MIT and served in the US Marine Corps for seven years, leaving active duty as a Captain.
  • Speaker: Josh Wills, Cloudera
  • Arguing for: Better Data
  • Session: When people are first introduced to Hadoop, one of the most common questions is, "when should I use Hadoop instead of a relational database?" In this talk, we'll walk through several use cases where Hadoop can solve problems better and faster than a relational database, even on relatively small data sets, in order to illustrate how Hadoop complements traditional data warehousing solutions.
  • Bio: Josh Wills is Cloudera's Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. He earned his Bachelor's degree in Mathematics from Duke University and his Master's in Operations Research from The University of Texas at Austin.


The meetup will take place in the Boylston Room at the Copley Marriott, 110 Huntington Avenue, Boston, from 6-8pm. As noted on the meetup page, this event is currently full, but you are welcome to add your name to the wait list!

After the meetup, we will adjourn to a nearby bar—location to be posted asap.

See you there!



Review: Karmasphere Analyst

This week, I took some time to evaluate Karmasphere  Analyst. Particularly, I was interested in how it worked with Hadoop (as opposed to MapR, which it also supports).

Setting up

The setup for Karmasphere is rather painless: a simple installer on windows and a shell script on Linux. However, the windows version does require cygwin. Once open, Karmasphere divides itself into three major steps.


This is where you set up connections to existing HDFS databases. Karmasphere only supports Hive, but it's pretty nice about it... kind of. It will go through the process of installing Hive for you through a rather nice GUI, which allows you to easily specify a Derby database, MySQL database, or whatever other database you have a Java connector for. The downside to this is you can't easily use an already-existing Hive installation. This was a major shortcoming for me, but I get the impression that it should be possible to import an existing Hive database. I'll let you know as soon as the Karmasphere rep gets back to me.


Once I decided to install a new Hive metastore (which was rather painless), importing new tables from sequence files was simple for all the steps that involved Karmasphere (making the sequence file was annoying though). I don't have a problem with how Karmasphere does this. My only real problem is that it seems to hide away the shell that interacts with the Hive cluster Karmasphere uses, which seems like it might be limiting. I could be wrong, but I don't see how you could ever import anything without working through Karmasphere.


Supposedly, this is where the magic happens. The interface here was much simpler compared to other analytic tools. But that may be because there is not fancy drag-and-drop interface, or amazing visual features. It turns out Karmasphere is a glorified query writer. But in its defense, it's very glorified. I've written queries against Hive before, but I've never managed to write them as quickly or as painlessly as Karmasphere allows me to. The bells and whistles it brings to the table include:

  • immediate and clear feedback regarding any errors or warnings in your queries
  • one-click execution of any written queries
  • caching of past queries and results
  • effective sampling of data to test queries on smaller subsets
  • Table, column, and function library indexes
  • A "Query Plan" which shows you just how exactly your query will translate into Hadoop map-reduces

Once you have your data, it's pretty simple to export that data into various useful mediums such as Excel files, SQL tables, or perhaps back into Hive. Also, there is some charting functionality that was relatively simple to use, although I didn't look too much into it since it wasn't of interest to me.


All this makes the tool worthwhile, but I'm not sure it's worth the price (we were unable to obtain pricing information at time of publication, but will update if they get back to us). Since ultimately, you are just making queries, it doesn't add any additional analytic functionality that we couldn't do before. Technically, once you make your query, you don't even need Karmasphere anymore. Although once you have your data, it does let you do several things with that data that would otherwise be difficult to do (export, graphing, etc...).

If you're looking to analyze your unstructured data, I would say Karmasphere is ill-suited for the task, as unstructured data tends to take more than just the SQL-like queries Hive offers. All in all, this product is useful. But once my trial runs up, I will discontinue use.