Cloudera Data Science Day Recap: Data Science in the Age of Computational Reproduction


[Image via Wikibon] 

[Image via Wikibon

I spent yesterday afternoon at the Marriott in my beloved Midtown East, learning about data science from a handful of the people most equipped to teach it: Jeff Hammerbacher (@hackingdata), Amr Adwallah (@adwallah), and Josh Wills (@josh_wills).

First up was Cloudera's founder and Chief Scientist, Jeff Hammerbacher, whose other claim to data fame is his stint at Facebook, where he built and lead the data team for two years. He also came up with the term "data scientist," mostly because he wanted to get the then research scientists to get off their cushy high horses and fix database bugs at 2am.

Jeff spoke a bit about getting Facebook's data science team up and running. Everyone started out as jacks of all trades, and didn't start to specialize until the team had surpassed thirty employees. Data scientists are most needed when you have small data teams, because they are so multipurpose/zoom-in-zoom-out.

Jeff asked how many people in the room had the official title of data science, and only about 2 out of 40 or so did. Some people think "Data Scientist" is just a marketing neologism for a job that already existed, but Jeff said that no, the word mattered, because it codified a role and the general duties associated with it: data modeling and analysis.

He then talked a bit about data science as a discipline, which consists of:

  • data preparation
  • data presentation
  • experimentation
  • observation
  • data products

A few notes:

  • Re presentation: Dashboards are critical for data presentation. Equipping data scientists with the skills needed to make a good dashboard is important because that's the first thing a data scientist will show the other team members
  • Re experimentation: People say that the goal of data science is to find nuggets. okay great--what's a nugget? Cloudera looks for canonical entities related to their products, and maps the distribution and momentum of those.
  • Re data products: data isn't just used to make better decisions--it's also used to make new products (recommenders, search rankings)

The next bit of his talk dealt with what he called the data scientist-computer symbiosis (and what David calls "melding with the machine").

His philosophy:

  • instrument everything
  • put all your data in one place
  • data first, questions later
  • store first, structure later (often the data model is dependent on the analysis you'd like to perform)
  • keep raw data forever
  • let everyone party on the data
  • introduce tools to support the whole research cycle (think of the scope of the product as the entire cycle, not just the container)
  • modular and composable infrastructure

Lastly, he spoke about the future of data science. At Cloudera, the future will likely bring search, mph, stream processing, graph computations, linear algebra, optimization, simulation, plus, later on, "last mile" tech like data libraries, languages, and IDE for data science.

Next up was Amr Awadallah, Cloudera's CTO. Previously, he was the VP of Product Intelligence Engineering at Yahoo during the time Doug Cutting created Hadoop and got the company to start using it. Amr talked about why he moved towards Hadoop and away from RDBMS-only.

1) moving data to compute doesn't scale 2) archiving = premature data death 3) can't explore original high fidelity raw data

He made a good, repeated case for why Hadoop: flexibility, scalability, economics. He emphasized the importance of measuring your return on byte--the value to be extracted from that byte divided by cost of storing byte. Not all of your data deserves to fly in first class.

Some cool use cases for Hadoop:

  • retail: price optimization
  • media: content targeting
  • finance: fraud detection
  • manufactoring: diagnostics

Josh Wills, who is Cloudera's Director of Data Science, spoke after Amr. I've heard Josh speak before--he was nice enough to give the closing argument at the Boston Hadoop User Group Meetup on Big Data vs. Better Data vs. Better Algorithms (Josh is for big data, made better through lots and lots of scrubbing). Like Amr and Jeff, Josh also has a cool consumer web background: he used to be the lead engineer on Google's ad auction.

Josh didn't invent the term data scientist, but he did come up with its most retweeted definition: a data scientist is "someone who is better at stats than any software engineer and better at software engineering than any statistician." I dig it.

Want to think like a data scientist? It's all about solving problems, not finding insights. Josh parallelizes everything. You need to make sure you're not solving the wrong problem, or rather, that you're not spending ages trying to solve the wrong problem. Iterate quickly. You also need to think like Caligula, not a pilgrim. Don't be a miser; take in all the data you can get your hands on.

If you want to build a data product, the first step is to assemble a crackerjack data science team. Since you're dealing with full life cycles of data, you need people who can deploy, not just analyze. The second step is to choose good problems. Never build anything exactly once--do it a lot or not at all. The third step is to design the model. Josh thinks you don't need to obsess endlessly over your machine learning algorithm du choix; just make sure to mind the gap between your model and your business. The objective of Facebook's friend recommender is not, in fact, to get you to click on the suggestions--it's to get you to spend more time on Facebook. If the recs the algorithm serves up don't achieve that goal, then the algorithm needs tweaking. Otherwise, leave it, and work on your sponsored stories. Seriously, please work on them. They are terribly integrated right now.

Other things to keep in mind: 1) amortize costs by keeping all yo data around, and analyzing it over and over and over again. 2) Measure everything.

If you want to put these lessons to use, Josh is teaching an Introduction to Data Science course in NYC December 12-14.

The last speaker was Jo Maitland, the Research Director for the Infrastructure/Cloud channel at GigaOM Pro. She spoke about the burgeoning data, burgeoning data market, and where the twain are going. As you might know, there is a lot of data traipsing about the interwebs and server racks and commodity servers. Some of it is human generated, some machine generated, some of it consumer web, some of it enterprise. Walmart handles more than 1 million customer transactions per hour. This stuff is so helpful in outing pregnant teens, y'all.

Anyways, there are a bunch of ways to handle this data. We can turn off computers and go be ascetics, which sounds really lovely, or we can be data-pinching misers, which is time consuming and useless, or we can use hadoop or nosql, which, conveniently, are being hardened for the enterprise. That's a new phrase for me--hardened for enterprise. Not sure I'm on board, to be honest.

Jo tells the investors in the audience that the platform layer, c'est finit. Desolee. The action in the next few years is going to be in the applications that sit on Hadoop/Nosql. Like ours. And also like:

  • Ops intelligence (Splunk)
  • Sales and Marketing (Good Data, Media Science, Bloomreach)
  • Viz (Tableau, Qlicktech, Palantir)
  • Biz intel (Platfora, Wibidata)
  • Online ads (Dataxu)
  • Data as a Service (FICO, Datasift, Bluekai)

And also like consumer-facing companies e.g. Square, the creepily-named Predpol (lets you predict crime in real time, very broken windows with brains) and 23andme (collects human genome data and lets you do personal analysis on your ancestry).

Jo sees data democratization and trust growing, along with a shift to real-time data interaction. She also sees an emerging investment opportunity in cloud-based big data services and security.

And now for the final portion of the evening: the panel. Joe moderated. I don't think she got to ask as many of her questions as she would have liked. Alas, that's what happens when you turn the mikes over to the dbas.

Q. 1: When to move data out of a RDBMS and into Hadoop?  A. 1: If it's unstructured.

Q. 2: What skillsets are important in terms of moving from biz analyst to data science? A. 1: Network security guys and people who tend to bridge the gaps between the biz analyst and etl groups.

Q. 3 : Where should the data science team sit in an organization? A. 3: The most important thing for a data science team is for it to be with the data, so the team should sit with whichever group in the org handles the data, unless it's a large team, in which case it works like a matrix.

Q. 4: Aannd we finally get a nate silver reference, plus this: is there a tool to decide x amount of data is good enough for me to answer the question? A. 4: Amr: More data will always give you better results. Josh: I'm not against sampling data, but only if it's very homogenous (read: boring).

Q. 5: Why do I need RDBMS and Hadoop?  A. 5: If your ingredients have already been prepped, sliced etc than you can cook fast, but only one thing. If your ingredients come in untouched, you can slice and dice them any which way.

Q. 6: How do you protect data if it's all stored on one coast or in one city?  A. 6: You can run a single Hadoop instance across multiple data centers (but it's tricky). Google has three data centers for everything, and everything gets written twice.

Q. 7: NoSQL vs Hadoop?  A. 7: Nosql is very good for lookups, latency, but not as good for throughput as hdfs.

And with that, my throughput: done. Many thanks to Cloudera for putting on such an informative, non-salesy event.

1 Comment