[Image via Datablog]

My original plan for this afternoon was to attend Jeremy Howard and Mike Bowles' session on predictive modeling, but, after a morning of focused web crawls, I decided to go listen to Simon Rogers (@smfrogers) and Michael Brunton-Spall (@bruntonspall) talk about data journalism instead. To cop a Britishism, it was brilliant. Rogers is the pioneering journalist behind The Guardian's uber-popular Datablog, and Brunton-Spall is one of the developers tasked with transforming reams of raw data into journalist-searchable information.

If you haven't ever read the Datablog, you should: it's a model for transparent, accessible business, giving readers a variety of ways to consume news, the numbers behind the news, and the methodology for obtaining these numbers. Datablog does a lot of the UK government's work for them, and a decent amount of our government's as well, turning paper and web documents into public google spreadsheets, interactive charts and visualizations, and editorial stories. As Rogers noted, while data used to be the domain of long-form journalism, our new crawling, parsing, and processing skills make it highly suitable for short-form news as well. It's pretty easy to imagine it becoming a real-time news source (I'm sure Automated Insights would agree).

This session used a bunch of Datablog posts and datasets to illustrate the parts of data journalism, which boil down to:

1) collect sent data, recurring events, breaking news, and theories to be explored

2) figure out what to compare or show change, what the data means, what other data sets to use with it

3) shove the chosen into spreadsheets

4) clean up the data: check for data in wrong format, merged cells, unnecessary columns of data, data measured in different units. 80% of their time is spent here

5) perform calculations on the data. recalculate if needed, sanity check the results

6) map the data in one or more formats (graphics, free viz tools, google fusion table, story, and/or just publish)

While some of the Datablog posts are fairly light-hearted (e.g. US Plastic Surgery Statistics, though that is also a bit scary), most of them offer the public substantiated cultural, institutional, and environmental conclusions,  e.g. that the bulk of the arrests during the UK's summer of unrest took place in its poorest neighborhoods, or that the battle between the 99% and the 1% should actually be between the 99.99% and the .01%. 

To help The Guardian's journalists identify the needles in the data haystack, the developers came up with a guideline they call "The Philosophy of Interesting Information." What qualifies as interesting?

  • metadata, as revealed in Wikileaks cables. US soldiers are much better at entering tags than diplomats
  • the habitual--it betrays the people who published in the info
  • distress
  • anomalies
  • visualizations

Journalists parse datasets for these qualities using Ajax Solr, which puts a more user-friendly interface atop Solr. It includes search, interactive graphs, and tag clouds, and looks quite nice, but is not available to the public.

Occasionally, the Datablog has turned to its readers for help in parsing massive amounts of pdfs. What they've found is that a) you need to recognize and reward contributors for their help or else they'll get bored midway through and b) for the crowd sourced data to be effective, you need people to comb through it. Long story short: cool concept, great for tips, pretty bad for data.

Since much of the Datablog datasets have a geographic component, the journalists often use Google's Fusion Tables to visualize them. There are two types of Fusion Tables that really work: ones with borders and ones with dots. In the last part of the session, Rogers showed us how to create a dot one that displayed where all the session attendees were from, along with age and eye color. If you have a google account, it's incredibly simple.

1) create/upload a spreadsheet/csv

2) create table based on that spreadsheet

3) visualize as map (geocode)

4) set window info (custom or automatic)

One thing to note is that Fusion Tables don't yet work with real-time databases, though the Google API team is working on it.