Yo! Big Data Raps, vol. 2: Is it wrong to think it's love when it tries the way it does?

Our second music video tribute to big data's best and brightest stars Hilary Mason. Along with being the Chief Scientist of Bitly (translation: she knows all about your clicks), Hilary is also: cofounder of HackNY, co-organizer of Data Gotham,  author of fun social data hacks like Book Book -- Goose and One Random Tweet, lover of cats as speaking devices, and champion of programming for all (including the readers of Glamour).

In this mashup, Hilary makes the case for cozying up to your data. "At home, in your underwear, you're a total badass," she says. Preach it!

Cloudera Data Science Day Recap: Data Science in the Age of Computational Reproduction

Cloudera Data Science Day Recap: Data Science in the Age of Computational Reproduction

I spent yesterday afternoon at the Marriott in my beloved Midtown East, learning about data science from a handful of the people most equipped to teach it: Jeff Hammerbacher (@hackingdata), Amr Adwallah (@adwallah), and Josh Wills (@josh_wills).

First up was Cloudera's founder and Chief Scientist, Jeff Hammerbacher, whose other claim to data fame is his stint at Facebook, where he built and lead the data team for two years. He also came up with the term "data scientist," mostly because he wanted to get the then research scientists to get off their cushy high horses and fix database bugs at 2am.

Jeff spoke a bit about getting Facebook's data science team up and running. Everyone started out as jacks of all trades, and didn't start to specialize until the team had surpassed thirty employees. Data scientists are most needed when you have small data teams, because they are so multipurpose/zoom-in-zoom-out.

Jeff asked how many people in the room had the official title of data science, and only about 2 out of 40 or so did. Some people think "Data Scientist" is just a marketing neologism for a job that already existed, but Jeff said that no, the word mattered, because it codified a role and the general duties associated with it: data modeling and analysis.

Read More

Strata Day 3 Recap: Privacy, Junk, and 1 Million Monkeys

  The sessions I attended on the final day of the Strata Conference converged around ethicality, legality, and human nature. Earlier, someone tweeted that the data is here, and the talent will catch up. This is true, and the real question is, once the talent's caught up, what will they do with their catch (or cache)? It's a question of volition, not ability, and as such, it is rather difficult to answer. My first session of the day, "If Data Wants to Be Free, Is Privacy a Prison?" focused, as Solon Barocas put it, on "the privacy implications of using public data to predict an individual's private propensities." There has been, recently, an outcropping of data usage cases where the line between what is public and what is private were blurred: the FBI's GPS surveillance, the suicide of Tyler Clementi, Target's recent pregnancy marketing debacle. The Supreme Court voted 8-1 that the car was an extension of the home, the one realm where privacy is generally considered sacrosanct. Generally. The Ravi trial has just gotten underway, and public opinion seems to side against the webcam-happy teen, but the public has been reared on after-school specials. (Ian Parker's piece in the New Yorker paints a more nuanced portrait of the situation and the parties involved.)

The Target case is, to me, the most interesting, because it is an example of Mosaic theory, using big data (here, running analytics against a data warehouse of Guest ID activity) to harvest a wealth of seemingly innocuous public information that nonetheless allows the harvester to infer potentially sensitive information about specific customers. Illegal? No--I'm sure Target has a very thorough terms of service buried somewhere on its site. But unethical? Maybe. Daniel Tunkelang tweeted that banning inference is akin to thought crime, and I see his point, but if the inference is algorithmically derived, is it thought or fact? Barocas said that Target's recalcitrance to ask its customers "are you pregnant" should have been an indication that the question was too sensitive to infer. I agree. Tip for man and machine alike: never ask a woman if she's pregnant!

The internet wants to be free, but many of its users want their data to be freed, and given the potentially brutal results of its being accessed and used to identify (imprisonment, loss of employment, destruction of property, even death), one can hardly fault them.

However, while data can be used against individuals, in the aggregate, the solutions it produces can be critical. Another of the day's sessions,"It's Not [Junk] Data Anymore," with Ben Goldacre, Kay Thaney, and Mark Hahnel, approached the public/private issue from a research perspective. When it comes to data sharing, researchers are at the opposite end of the spectrum from the general electronic device user: they have to be coaxed to share all but the most glorious results. Goldacre noted that between 1/3-2/3 of medical trials don't get published, and these tend to be the ones with actively negative or statistically insignificant results. Add that to science journals' penchant for false positives, and the average citizen has access to a worrisomely incomplete or inaccurate portrait of disease and medications.

To fill it in, Goldacre suggested that we "encourage sharing, mandate publication, and provide a common structure" for raw research, which tends to be ill-suited for long papers, anyway. I fear mandated publication could backfire, but the the other components are promising. Hahnel's company Figshare makes it very simple for researchers to upload their research in a variety of forms, and provides them with a breadth of metrics as carrots. A simple approach with a clear goal and a clean UI, it seems to be garnering a lot of attention in the data science world, and I hope it has legs.

My last session of the day was Robbie Allen's "From Big Data to Big Insights." Allen's company, Automated Insights, makes a software that creates automated content for a wealth of sports blogs, real estate and neighborhood watch blogs, financial tear sheets, insider trading reports, weather sites and other news sources that traditionally have a high proportion of quantitative content. As a writer and linguistic enthusiast, I find the concept of automated content, which assigns anything from a quotation mark up to several paragraphs to a key value, both fascinating and frightening. On the one hand, it's a great grunt work tool, but its ability to mimic the styles of human authors could lead to a map > territory situation. If the simulacra is good enough, what chance has the real?

That being said, if software can mimic an individual author's style, perhaps it can also scramble or dilute it. As Solon Barocas said, anonymizing is extremely difficult; using an automated content program to spin stories from bare gists would be a godsend for those who want/need their words to remain anonymous.

In MacLuhan's global village, do the doors have locks?

Strata Schemata: An Agenda of Sorts

This time next week, if all goes as planned, I’ll be on a plane back to Logan with a head full of data, a twitter account full of people who work with data, and a suitcase full of shorts. The Strata Conference, which takes starts this Tuesday, February 28th, and goes through Thursday, March 1st, is one of the country’s preeminent data science conferences. Certainly its sessions, which, in addition to the data scientists themselves, target CTOs/CIOs, marketers and journalists, are the broadest a’beam of any I’ve seen. And its speakers are (deservedly) bold-faced names to even the greenest of data geeks: Doug Cutting (Cloudera), Ben Goldacre (Bad Science), Hal Varian (Google), Mike Olson (Cloudera), JP Morgenthal (EMC), Alistair Croll (Bitcurrent), Usman Haque (Pachube), Coco Krumme (MIT Media Lab), O’Reilly’s own Edd Dumbill… It’s a lot to take in, especially if you, like me, are the type who panics at buffets and department stores. So, to help the both of us, I’ve made up a “Can’t Miss List (for the Social Strategist).” Annnd, since I’ll warrant that many, if not most, of you are not social strategists, I’m also giving you a “Can’t Miss List (for the CEO/Chief Scientist,” courtesy of our own CEO, David Wihl. You can thank, or debate, with both of us in the comments, on Twitter, or, best yet, at the show!

Can’t Miss List (for the Social Strategist)

Name Handle Presentation Time
Ken Krugler kkrugler Large scale web mining Tues, 9am
Mike Bowles

Jeremy Howard

mike_bow

jeremyphoward

The Two Most Important Algorithms in Predictive Modeling Today Tues, 1:30pm
Billy Bosworth Data as a Strategic Weapon - Walmart, Netfix and Apigee Panel Discussion Wed, 10:40am
Marcel Salathé marcelsalathe Understanding Social Contagion Wed, 11:30am
Jesper Andersen jandersen Building a Data Narrative: Discovering Haight Street Wed, 1:30pm
Philip Kromer mrflip Disambiguation: Embrace wrong answers & find truth Wed, 2:20pm
Xavier Amatriain xamat Netflix recommendations: beyond the 5 stars Wed, 4:00pm
Christopher Berry cjpberry Data Science in Marketing Analytics Wed, 4:50pm
Jim Adler

Solon Barocas

jim_adler

s010n

If Data Wants to Be Free, is Privacy a Prison? Thurs, 10:40am

 

 

Nathan Marz nathanmarz Storm: distributed and fault-tolerant realtime computation Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Ben Goldacre

Kaitlin Thaney

Bengoldacre

kaythaney

It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Mark Hahnel figshare It's Not "Junk" [Data] Anymore Thurs, 2:20pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm

 

Can't Miss List for the CEO/Chief Scientist

Name Handle Presentation Time
Michael Rys sqlservermike SQL and NoSQL Are Two Sides Of The Same Coin  Tues, 9:00am
Claudia Perlich From Knowing ‘What’ To Understanding ‘Why’  Tues, 9:45am
Monica Rogati mrogati The Model and the Train Wreck: A Training Data How-to  Tues, 11:00am
Jacob Perkins thedatachef Corpus Bootstrapping with NLTK  Tues, 11:30am
Ben Gimpert someben The Importance of Importance: An Introduction to Feature Selection Tues, 12:00pm
Matt Biddulph mattb Social Network Analysis Isn’t Just For People  Tues, 1:30pm
Robert Lefkowitz r0ml Array Theory vs. Set Theory in Managing Data  Tues, 2:15pm
Robert Lancaster rob1lancaster Survival Analysis for Cache Time-to-Live Optimization  Tues, 3:30pm
Eric Badeschwieler jeric14 The Future of Hadoop: Becoming an Enterprise Standard Wed, 10:40am
Alexander Stojanovic stojanovic Unleash Insights On All Data With Microsoft Big Data Wed, 11:30am
Pascal Boillat Changing Data Standards from Wall Street to DC and Beyond Wed, 1:30pm
Jen Zeralli Big Data: Wall Street Style Wed, 2:20pm
Kuntal Malia Analytics in a Community-Driven Fashion Retailer Wed, 4:00pm
Leigh Dodds ldodds Linked Data: Turning the Web into a Context Graph Wed, 4:50pm
Kirkland Barrett Democratizing BI at Microsoft: 40,000 Users and Counting Thurs, 10:40am
Stefan Groschupf datameer Hadoop Analytics in Financial Services Thurs, 11:30am
Alyona Medelyan zelandiya Mining Unstructured Data: Practical Applications Thurs, 1:30pm
Robbie Allen robbieallen From Big Data to Big Insights Thurs, 4:00pm
Marc Smith marc_smith Mapping social media networks (with no coding) using NodeXL Thurs, 4:50pm