A blog about security, privacy, algorithms, and email in the enterprise. 

Viewing entries tagged


Strata Day 2 Recap: Big Data, Wall Street Style and Netflix Recommendations

I'm combining these two because the first doesn't require much ink. Big Data: Wall Street Style

Featuring: Jeff Sternberg, Jen Zeralli

This was a pure sales pitch for S&P Capital IQ. To be fair, some of the functionality behind their dashboard, especially the "companies you may be interested in" recommendation engine, is pretty cool, but a) I was hoping for some dirt on black box algorithms and b) SPCIQ's web-front end has an offputtingly bad, 1995-all-html-all-the-time UI.

No shilling, but a few facts:

  • Of the more than $2.35 trillion that has been invested in IT over last 10 years, the amount invested in Big Data technologies comes to somewhere around 4%.
  • SPCIQ gets 67k docs/day, which are stored in a document repository comprised of SQL (for the metadata), a filesystem, and Solr/Lucene for searching.
  • For their recommendation engine, they use signals from Hadoop and Hive to score each suggestion for each user.

That's pretty much it, IMHO. Onto 'flix.

Netflix Recommendations: Beyond the 5 Stars

In this session, Xaviar Amatriain (@xamat) dissected the anatomy of a Netflix recommendation. Good stuff, though he was really hard to hear. Some facts:

  • Netflix recommendations are per account, not per person, which is why, as one Twitterer noted, your eight year-old is told she might enjoy Cape Fear.
  • The "continue watching" button is a very important recommendation validation
  • Netflix uses a combination of implicit (tracking user behavior) and explicit (asking users "this or that or the other" questions) methods to set taste preferences. They also take freshness and diversity to determine genre selections.
  • Netflix's similars are computed from different data sources including metadata, ratings, and viewing data, and can be treated as data/features. They are used in response to user actions.
  • Ranking of films uses popularity as a baseline, and is determined through a combination of scoring, sorting, and filtering, with the goal of finding the best possible ordering of a set of videos for a user within a specific context in real-time. Whew.
  • Predictive film ranking is akin to CTR forecasting for ad/search results.
  • Perhaps most importantly, Xavier is in the "a few good algorithms > massive amounts of data." Netflix found that beyond a few thousand trained samples, the accuracy of their recommendation model levels out.







1 Comment

Strata Day 2 Recap: Data as a Strategic Weapon

Featuring: Billy Bosworth (DataStax, @datastax), Jeremy Edberg (Netflix, @jedberg), STS Prasad (Walmart, @stsprasad), Ed Anuff (Apigee, @edanuff) DataStax CEO Billy Bosworth moderated this panel about the business motives behind and effects of making the distributed computing jump. Edberg, Prasad and Anuff all said it was a matter of sink or swim (or scale up)--they couldn't go on supporting their customers in a meaningful way if they stuck with a data warehouse. Netflix needed a distributed, resilient system; Walmart needed to rapidly process data into what Prasad called "the social genome." All three companies ended up choosing Cassandra.

Challenges in moving to distributed computing:

  • Single-node loss, because it overloaded neighbor nodes
  • Rethinking ways to find and query data
  • Compaction--it causes a performance hit, so Walmart ended up using SSDs to compensate.

Nice surprises:

  • Counters made it easy to implement a system with real-time reporting.
  • The ability to administer Cassandra and get the best from it hasn't translated into need to increase hiring.
  • No need to worry about disappearing data.
  • Speed of negative lookups is much faster than expected.

In conclusion:

  • The more data you put into your system, the better the system gets. (This has been a popular refrain).
  • Relational databases aren't going away--there's a place for both relational and distributed, and most companies will need both. NoSQL for real-time; SQL for batch-processing.
  • Just having the data isn't enough--you need the ability to rapidly extract insight from it.
  • Look to see more innovation in the business application side

Interesting fact: Netflix uses multi-region rings--one cassandra cluster across multiple geographic regions--both for resilience and so its US customers can travel abroad without loss of service.


1 Comment