big data conferences[Image via DataGotham]

The next few months are a Northeastern data-fiend’s dream. Nearly every week boasts at least one high-caliber conference, and they touch on everything from data journalism to Julia to in-memory data stores to cross-disciplinary analytics.  The following six caught my eye, but you can find more on the excellent conference discovery site lanyrd.com.

1)    DataGotham The pitch:

DataGotham is a celebration of New York City's data community that will bring together professionals from finance to fashion and from startups to the Fortune 500.

The speakers:

  • Michael P. Flowers, Director of  the Mayor’s Financial Crime Task Force and the NYC Policy and Strategic Planning Analytics Team
  • Steven E. Koonin, Director of  NYU's Center for Urban Science and Progress
  • Blake Shaw, Data Scientist at Foursquare
  • Adam Laiacano, Data Scientist at Tumblr
  • Alicia Rankin, Head of Research and Fan Insights for NFL
  • Jake Porway, Founder and Exec Director of DataKind
  • Matthew Israel, Director of the Art Genome Project at Art.sy

The sessions: Not yet available, but tutorials include “Data Journalism Fundamentals,” “MongoDB & R,” and “An Introduction to Julia” The space: NYU’s Stern School of Business (Paulson Auditorium and classrooms) The cost: $499, or $250 for academics and non-profits The dates: Sept. 13-14

2)    Strata Conference and Hadoop World The pitch:

The O’Reilly Strata Conference explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World to create the largest gathering of the Apache Hadoop community in the world. Strata brings together decision makers using the raw power of big data to drive business strategy, and practitioners who collect, analyze, and manipulate that data—particularly in the worlds of finance, media, and government.

The speakers:

  • Mike Olson, Cloudera
  • Alistair Croll, Solve for Interesting
  • Edd Dumbill, O’Reilly Media
  • Jim Adler, Chief Privacy Officer, Intelius
  • Abhijit Bose, Director and Senior Data Scientist of Digital Analytics, American Express
  • Alice Brennan, Journalist, the New York World

The sessions: Search and Real-time Analytics on Big Data, Moneyball for New York City, Analyzing Millions of GitHub Commits: What Makes Developers Happy, Angry, and Everything Inbetween?, Finance vs Machine Learning The space: New York Hilton The cost: $595-$2045 The dates: Oct. 23-25

3) Big Data Innovation The pitch:

The Big Data Innovation Summit is the largest gathering of Fortune 500 business executives leading Big Data initiatives.

The speakers:

  • Kurt Smith, Data Scientist at Twitter
  • Mohammad Sabah, Data Scientist at Facebook
  • Ashok Srivastava, Principal Scientist, NASA
  • Steve Hirsh, Chief Data Officer, NYSE
  • Arun Jacob, Director of Data Solutions, Walt Disney

The sessions: “Experimentation at eBay,” “Better Health through Data Science,” “Data-Infused Product Design and Insights at Linkedin” The space: Hyatt Regency Boston One The extracurriculars: Networking drinks in the Exhibitions area, 6pm, 9/13 The cost: $595-$2045 The dates: Sept. 13-14

4)    Government Big Data The pitch:

This outstanding conference brings together the key government and industry experts who are shaping the direction of big data research and development across the Federal Government. They willprovide you with an in-depth understanding of Federal agency strategy and plans, the status and forecast for key big data initiatives, and the latest tools and technologies being developed to exploit the massive amounts of information being collected at the Federal level.

The speakers:

  • Dr. Sasi Pillay, CTO for IT, NASA
  • Dr. Christopher White, Program Manager, Information Innovation, DARPA
  • Tasso Argyros, Co-President, Aster Data
  • Susie Adams, CTO, Microsoft Federal
  • Eric Braverman, Partner, McKinsey and Company

The sessions: “Perspectives from the Office of the Secretary of Defense,” ““National Aeronautics and Space Agency Perspectives and Initiatives,” “2.0: Surveillance Solution in the Cloud,” “We Didn’t Try to Grow a Bigger Ox: Why USASearch Uses Hadoop” The space: Holiday Inn Rosslyn at Key Bridge, Arlington, VA The cost: $1290 The dates: Sept. 18-19

5)    Visualized The pitch:

VISUALIZED explores the evolution of communication at the intersection of big data, storytelling and design. Gain insight into designing data-driven narratives that connect with audiences and visualize the human experience.

The speakers:

  • Shaw Hwang, Design Technologist, Trulia
  • Katy Harris, Information Designer, Fathom
  • Hilary Mason, Chief Scientist, Bitly
  • Simon Rogers, Data Journalist, Editor at Large, The Guardian UK
  • Scott Belksy, CEO, Behance
  • Shan Carter, Interactive Graphics Editor, New York Times

The space: Times Center Manhattan The cost: $799, or $699 if you donate a high-quality used book The dates: Nov. 8-9

6) Text Analytics World The pitch:

Text Analytics World is the full-spectrum conference that covers all aspects of text analytics. To solidify the business value you gain from text analytics, TAW delivers the latest methods/techniques, demonstrating their deployment across a wide range of industries large and small.

The speakers:

  • Sarah Ann Berndt, Taxonomist, Johnson Space Center
  • Anna Divoli, Senior Software Researcher, Pingar
  • Heather Edwards, Taxonomy Developer, AP
  • Sue Feldman, Research Vice President, Search and Discovery Technologies, IDC
  • Gregory Piatetsky-Shapiro, Editor, KDNuggets

The sessions: Predictive Coding in E-Discovery, Crossing the Language Chasm: Extracting Information from Foreign Language Text, Unified Access to Enterprise Information, Big Data and Big Analytics Trends: The Promise and the Hype, Harnessing the power of text analytics to drive human capital The space: Seaport World Trade Center The cost: $990-$1790 The dates: October 3-5

Posted
AuthorClaire Willett
2 CommentsPost a comment

I'm combining these two because the first doesn't require much ink. Big Data: Wall Street Style

Featuring: Jeff Sternberg, Jen Zeralli

This was a pure sales pitch for S&P Capital IQ. To be fair, some of the functionality behind their dashboard, especially the "companies you may be interested in" recommendation engine, is pretty cool, but a) I was hoping for some dirt on black box algorithms and b) SPCIQ's web-front end has an offputtingly bad, 1995-all-html-all-the-time UI.

No shilling, but a few facts:

  • Of the more than $2.35 trillion that has been invested in IT over last 10 years, the amount invested in Big Data technologies comes to somewhere around 4%.
  • SPCIQ gets 67k docs/day, which are stored in a document repository comprised of SQL (for the metadata), a filesystem, and Solr/Lucene for searching.
  • For their recommendation engine, they use signals from Hadoop and Hive to score each suggestion for each user.

That's pretty much it, IMHO. Onto 'flix.

Netflix Recommendations: Beyond the 5 Stars

In this session, Xaviar Amatriain (@xamat) dissected the anatomy of a Netflix recommendation. Good stuff, though he was really hard to hear. Some facts:

  • Netflix recommendations are per account, not per person, which is why, as one Twitterer noted, your eight year-old is told she might enjoy Cape Fear.
  • The "continue watching" button is a very important recommendation validation
  • Netflix uses a combination of implicit (tracking user behavior) and explicit (asking users "this or that or the other" questions) methods to set taste preferences. They also take freshness and diversity to determine genre selections.
  • Netflix's similars are computed from different data sources including metadata, ratings, and viewing data, and can be treated as data/features. They are used in response to user actions.
  • Ranking of films uses popularity as a baseline, and is determined through a combination of scoring, sorting, and filtering, with the goal of finding the best possible ordering of a set of videos for a user within a specific context in real-time. Whew.
  • Predictive film ranking is akin to CTR forecasting for ad/search results.
  • Perhaps most importantly, Xavier is in the "a few good algorithms > massive amounts of data." Netflix found that beyond a few thousand trained samples, the accuracy of their recommendation model levels out.

 

 

 

 

 

Posted
AuthorClaire Willett

Featuring: Marcel Salathé Salathé's bread and butter is infectious diseases, but his research group at Penn State has found that the disease contagion patterns can be applied to social behaviors as well. Which is good, as he thinks that in the near future, understanding social dynamics is going to be as important as understanding the spread of germs. In this session, he took us through the definition and dynamics of social contagion.

Let's say Al bought a segway, and his neighbor Phil, upon seeing Al chugging merrily along astride it (can you be astride a segway?), is impelled to go buy one for himself. Contagion. Butttt, let's say Al buys a segway, immediately goes on vacation, and while he's in Aruba, Phil buys a segway. Not contagion. In both cases, the two men are now connected, but in the former case, the connection (owning a segway) is the result of the men's similarity; in the the latter, it is cause of it.

The first case is an example of homophily, or the "birds of a feather flock together" phenomenon, and Salathe says that if you account for it, almost all contagion disappears. I'm not quite sure how this works--a cluster of points may have a smaller outside edge than the same number of scattered points, but, unless we're talking a completely isolated community, the edge is still there.

Salathé used his H1N1 Vaccine Twitter study to illustrate the dynamics of social contagion. He and his students collected all H1N1-related tweets, ran sentiment analysis on them, and used natural language processing to calculate the average sentiment core over time. They wanted to find out a) whether increased positive sentiment about the vaccine corresponded to increased vaccination rates, which it did. They also wanted to see whether Twitter users with similar sentiment are more likely to be connected (positive assortativity) than are users with different sentiments (negative assortativity), and they found that positive assortativity did increase the change of contagion, though negative assortativity did not. A few interesting Twitter assortivity findings:

  • Your follower count has no affect on your tweeting positively, but does have an affect on your tweeting negatively
  • If you have more negative friends, you're less likely to tweet negatively
  • The more positive tweets you're exposed to, the more likely you are to tweet negatively. This last one goes against the popular "spread the cheer" viral messaging technique.

 

 

Posted
AuthorClaire Willett

Featuring: Billy Bosworth (DataStax, @datastax), Jeremy Edberg (Netflix, @jedberg), STS Prasad (Walmart, @stsprasad), Ed Anuff (Apigee, @edanuff) DataStax CEO Billy Bosworth moderated this panel about the business motives behind and effects of making the distributed computing jump. Edberg, Prasad and Anuff all said it was a matter of sink or swim (or scale up)--they couldn't go on supporting their customers in a meaningful way if they stuck with a data warehouse. Netflix needed a distributed, resilient system; Walmart needed to rapidly process data into what Prasad called "the social genome." All three companies ended up choosing Cassandra.

Challenges in moving to distributed computing:

  • Single-node loss, because it overloaded neighbor nodes
  • Rethinking ways to find and query data
  • Compaction--it causes a performance hit, so Walmart ended up using SSDs to compensate.

Nice surprises:

  • Counters made it easy to implement a system with real-time reporting.
  • The ability to administer Cassandra and get the best from it hasn't translated into need to increase hiring.
  • No need to worry about disappearing data.
  • Speed of negative lookups is much faster than expected.

In conclusion:

  • The more data you put into your system, the better the system gets. (This has been a popular refrain).
  • Relational databases aren't going away--there's a place for both relational and distributed, and most companies will need both. NoSQL for real-time; SQL for batch-processing.
  • Just having the data isn't enough--you need the ability to rapidly extract insight from it.
  • Look to see more innovation in the business application side

Interesting fact: Netflix uses multi-region rings--one cassandra cluster across multiple geographic regions--both for resilience and so its US customers can travel abroad without loss of service.

 

Posted
AuthorClaire Willett

[Image via Datablog]

My original plan for this afternoon was to attend Jeremy Howard and Mike Bowles' session on predictive modeling, but, after a morning of focused web crawls, I decided to go listen to Simon Rogers (@smfrogers) and Michael Brunton-Spall (@bruntonspall) talk about data journalism instead. To cop a Britishism, it was brilliant. Rogers is the pioneering journalist behind The Guardian's uber-popular Datablog, and Brunton-Spall is one of the developers tasked with transforming reams of raw data into journalist-searchable information.

If you haven't ever read the Datablog, you should: it's a model for transparent, accessible business, giving readers a variety of ways to consume news, the numbers behind the news, and the methodology for obtaining these numbers. Datablog does a lot of the UK government's work for them, and a decent amount of our government's as well, turning paper and web documents into public google spreadsheets, interactive charts and visualizations, and editorial stories. As Rogers noted, while data used to be the domain of long-form journalism, our new crawling, parsing, and processing skills make it highly suitable for short-form news as well. It's pretty easy to imagine it becoming a real-time news source (I'm sure Automated Insights would agree).

This session used a bunch of Datablog posts and datasets to illustrate the parts of data journalism, which boil down to:

1) collect sent data, recurring events, breaking news, and theories to be explored

2) figure out what to compare or show change, what the data means, what other data sets to use with it

3) shove the chosen into spreadsheets

4) clean up the data: check for data in wrong format, merged cells, unnecessary columns of data, data measured in different units. 80% of their time is spent here

5) perform calculations on the data. recalculate if needed, sanity check the results

6) map the data in one or more formats (graphics, free viz tools, google fusion table, story, and/or just publish)

While some of the Datablog posts are fairly light-hearted (e.g. US Plastic Surgery Statistics, though that is also a bit scary), most of them offer the public substantiated cultural, institutional, and environmental conclusions,  e.g. that the bulk of the arrests during the UK's summer of unrest took place in its poorest neighborhoods, or that the battle between the 99% and the 1% should actually be between the 99.99% and the .01%. 

To help The Guardian's journalists identify the needles in the data haystack, the developers came up with a guideline they call "The Philosophy of Interesting Information." What qualifies as interesting?

  • metadata, as revealed in Wikileaks cables. US soldiers are much better at entering tags than diplomats
  • the habitual--it betrays the people who published in the info
  • distress
  • anomalies
  • visualizations

Journalists parse datasets for these qualities using Ajax Solr, which puts a more user-friendly interface atop Solr. It includes search, interactive graphs, and tag clouds, and looks quite nice, but is not available to the public.

Occasionally, the Datablog has turned to its readers for help in parsing massive amounts of pdfs. What they've found is that a) you need to recognize and reward contributors for their help or else they'll get bored midway through and b) for the crowd sourced data to be effective, you need people to comb through it. Long story short: cool concept, great for tips, pretty bad for data.

Since much of the Datablog datasets have a geographic component, the journalists often use Google's Fusion Tables to visualize them. There are two types of Fusion Tables that really work: ones with borders and ones with dots. In the last part of the session, Rogers showed us how to create a dot one that displayed where all the session attendees were from, along with age and eye color. If you have a google account, it's incredibly simple.

1) create/upload a spreadsheet/csv

2) create table based on that spreadsheet

3) visualize as map (geocode)

4) set window info (custom or automatic)

One thing to note is that Fusion Tables don't yet work with real-time databases, though the Google API team is working on it.

 

 

Posted
AuthorClaire Willett

Web mining consists of crawling the world wide web, and extracting and processing its structure, usage, or content. This tutorial, taught by Scale Unlimited founder Ken Krugler (@kkrugler), focused on content mining at a large scale (as you might surmise from its title). After a thorough overview of web mining, we did a focused crawl for ultimate frisbee images. The first step in any type of web mining is crawling, or fetching and parsing web pages. There are four types of web crawls: broad (e.g. bingbot), focused, domain, and what Krugler calls the "don't crawl" crawl, wherein you leverage other people's crawl data, which besides usually being faster and cheaper, also reduces the load on your crawlees' servers. If you go for this option, you can either use public datasets like CommonCrawl or Wikipedia or commercial providers like Spinner and Infochimps.

Per Krugler, the general rule of crawling solutions is: don't roll your own! No matter your language, there's an open source option. Java has Nutch and Heritrix, Python has Scrapey, PHP has the more literally named php-crawler. Whichever solution you end up going with, make sure it's reliable, scalable, and fault-tolerant. Sure, a single server can fetch lots of pages, but scaling becomes an issue with post-processing. 

Keep in mind that unless you're creating an index of the crawled sites, you're breaking the implicit "traffic-for-bandwidth" contract, and therefore risk having the wrath of the webmaster reign down on your inquisitive shoulders.

Today's lab had us doing a focused crawl, so Krugler took us through the basics. First, you need to start with good seed urls, or really high-quality pages. You can use a paid/free service to find these, or you can use search and manually enter them or make calls to the APIs. After you have your seeds, you fetch and parse new urls and give them a page score.  When parsing, be sure to normalize the outlines (e.g. http://basketball.com becomes http://www.basketball.com), and use a suffix filter to skip links to low-value pages like images and pdf files.

You can score pages by tokenizing their text, using simple term-based scoring or using a Support Vector Machine, which is trained using "documents" that have features and a positive or negative class, creates a statistical model that divides all the documents into separate positive or negative classes, and then uses this model to assign unknown documents one of these classes.

Graphics and small amount of content (e.g. a definition page)are some of the wrenches in determining page quality. You can filter these pages out by setting a minimum real content threshold. By "real content," Krugler means the stuff that's not Chrome, cruft or boilerplate. You can scrape off this junk with Boilerpipe.

After you have your page score, it's time to extract and score its outline before putting the url into the url state, or the database of all known urls. There are three attributes of extraction: broad, precise, and accurate. You get to pick two, and these depend on whether you're extracting unstructured (broad, accurate), semi-structured (broad, precise), or structured (precise, accurate) data.

Usually, no matter which type of data you're extracting, you'll need:

  • To clean the html. You can use a library like NekoHTML for this. Keep in mind that the end result won't match the original text.
  • A charset to convert bytes to characters. Tika works pretty well.
  • A boilerplate scraper like Boilerpipe.
  • Some means of identifying the page's language (http response header, meta tag, tag attribute, or text analysis).

If you're doing unstructured extraction, your goal is to extract text without much additional processing (often there are just a few HTML fields). If you're doing semi-structured extraction, your goal is to find data in random text. Since it's not very format-specific, this can be applied broadly, often at the expense of accuracy. Easy patterns like telephone numbers, micro formats, and NLP named entities all usually work well with semi-structured extraction.

If you're doing structured extraction, your goal is to extract specific types of data, typically from one area of one site. You often do this with XPath, and if this is the case, Firebug is your friend, as are the div and span tags. If you run into pages that generate javascript, you will need Firebug or an equivalent to inspect the DOM. Options for precessing pages with javascript include HTMLUnit, qt-webkit and headless Mozilla, but keep in mind that processing a page with javascript takes about 10 times longer than just loading page text.

 

Posted
AuthorClaire Willett

This time next week, if all goes as planned, I’ll be on a plane back to Logan with a head full of data, a twitter account full of people who work with data, and a suitcase full of shorts. The Strata Conference, which takes starts this Tuesday, February 28th, and goes through Thursday, March 1st, is one of the country’s preeminent data science conferences. Certainly its sessions, which, in addition to the data scientists themselves, target CTOs/CIOs, marketers and journalists, are the broadest a’beam of any I’ve seen. And its speakers are (deservedly) bold-faced names to even the greenest of data geeks: Doug Cutting (Cloudera), Ben Goldacre (Bad Science), Hal Varian (Google), Mike Olson (Cloudera), JP Morgenthal (EMC), Alistair Croll (Bitcurrent), Usman Haque (Pachube), Coco Krumme (MIT Media Lab), O’Reilly’s own Edd Dumbill… It’s a lot to take in, especially if you, like me, are the type who panics at buffets and department stores. So, to help the both of us, I’ve made up a “Can’t Miss List (for the Social Strategist).” Annnd, since I’ll warrant that many, if not most, of you are not social strategists, I’m also giving you a “Can’t Miss List (for the CEO/Chief Scientist,” courtesy of our own CEO, David Wihl. You can thank, or debate, with both of us in the comments, on Twitter, or, best yet, at the show!

Can’t Miss List (for the Social Strategist)

Name Handle Presentation Time
Ken Krugler kkrugler Large scale web mining Tues, 9am
Mike Bowles

Jeremy Howard

mike_bow

jeremyphoward

The Two Most Important Algorithms in Predictive Modeling Today Tues, 1:30pm
Billy Bosworth Data as a Strategic Weapon - Walmart, Netfix and Apigee Panel Discussion Wed, 10:40am
Marcel Salathé marcelsalathe Understanding Social Contagion Wed, 11:30am
Jesper Andersen jandersen Building a Data Narrative: Discovering Haight Street Wed, 1:30pm
Philip Kromer mrflip Disambiguation: Embrace wrong answers & find truth Wed, 2:20pm
Xavier Amatriain xamat Netflix recommendations: beyond the 5 stars Wed, 4:00pm
Christopher Berry cjpberry Data Science in Marketing Analytics Wed, 4:50pm
Jim Adler

Solon Barocas

jim_adler

s010n

If Data Wants to Be Free, is Privacy a Prison? Thurs, 10:40am

 

 

Nathan Marz nathanmarz Storm: distributed and fault-tolerant realtime computation Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Ben Goldacre

Kaitlin Thaney

Bengoldacre

kaythaney

It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Mark Hahnel figshare It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm

 

Can't Miss List for the CEO/Chief Scientist

Name Handle Presentation Time
Michael Rys sqlservermike SQL and NoSQL Are Two Sides Of The Same Coin  Tues, 9:00am
Claudia Perlich From Knowing ‘What’ To Understanding ‘Why’  Tues, 9:45am
Monica Rogati mrogati The Model and the Train Wreck: A Training Data How-to  Tues, 11:00am
Jacob Perkins thedatachef Corpus Bootstrapping with NLTK  Tues, 11:30am
Ben Gimpert someben The Importance of Importance: An Introduction to Feature Selection Tues, 12:00pm
Matt Biddulph mattb Social Network Analysis Isn’t Just For People  Tues, 1:30pm
Robert Lefkowitz r0ml Array Theory vs. Set Theory in Managing Data  Tues, 2:15pm
Robert Lancaster rob1lancaster Survival Analysis for Cache Time-to-Live Optimization  Tues, 3:30pm
Eric Badeschwieler jeric14 The Future of Hadoop: Becoming an Enterprise Standard Wed, 10:40am
Alexander Stojanovic stojanovic Unleash Insights On All Data With Microsoft Big Data Wed, 11:30am
Pascal Boillat Changing Data Standards from Wall Street to DC and Beyond Wed, 1:30pm
Jen Zeralli Big Data: Wall Street Style Wed, 2:20pm
Kuntal Malia Analytics in a Community-Driven Fashion Retailer Wed, 4:00pm
Leigh Dodds ldodds Linked Data: Turning the Web into a Context Graph Wed, 4:50pm
Kirkland Barrett Democratizing BI at Microsoft: 40,000 Users and Counting Thurs, 10:40am
Stefan Groschupf datameer Hadoop Analytics in Financial Services Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm
Posted
AuthorClaire Willett