I'm combining these two because the first doesn't require much ink. Big Data: Wall Street Style

Featuring: Jeff Sternberg, Jen Zeralli

This was a pure sales pitch for S&P Capital IQ. To be fair, some of the functionality behind their dashboard, especially the "companies you may be interested in" recommendation engine, is pretty cool, but a) I was hoping for some dirt on black box algorithms and b) SPCIQ's web-front end has an offputtingly bad, 1995-all-html-all-the-time UI.

No shilling, but a few facts:

  • Of the more than $2.35 trillion that has been invested in IT over last 10 years, the amount invested in Big Data technologies comes to somewhere around 4%.
  • SPCIQ gets 67k docs/day, which are stored in a document repository comprised of SQL (for the metadata), a filesystem, and Solr/Lucene for searching.
  • For their recommendation engine, they use signals from Hadoop and Hive to score each suggestion for each user.

That's pretty much it, IMHO. Onto 'flix.

Netflix Recommendations: Beyond the 5 Stars

In this session, Xaviar Amatriain (@xamat) dissected the anatomy of a Netflix recommendation. Good stuff, though he was really hard to hear. Some facts:

  • Netflix recommendations are per account, not per person, which is why, as one Twitterer noted, your eight year-old is told she might enjoy Cape Fear.
  • The "continue watching" button is a very important recommendation validation
  • Netflix uses a combination of implicit (tracking user behavior) and explicit (asking users "this or that or the other" questions) methods to set taste preferences. They also take freshness and diversity to determine genre selections.
  • Netflix's similars are computed from different data sources including metadata, ratings, and viewing data, and can be treated as data/features. They are used in response to user actions.
  • Ranking of films uses popularity as a baseline, and is determined through a combination of scoring, sorting, and filtering, with the goal of finding the best possible ordering of a set of videos for a user within a specific context in real-time. Whew.
  • Predictive film ranking is akin to CTR forecasting for ad/search results.
  • Perhaps most importantly, Xavier is in the "a few good algorithms > massive amounts of data." Netflix found that beyond a few thousand trained samples, the accuracy of their recommendation model levels out.