The sessions I attended on the final day of the Strata Conference converged around ethicality, legality, and human nature. Earlier, someone tweeted that the data is here, and the talent will catch up. This is true, and the real question is, once the talent's caught up, what will they do with their catch (or cache)? It's a question of volition, not ability, and as such, it is rather difficult to answer. My first session of the day, "If Data Wants to Be Free, Is Privacy a Prison?" focused, as Solon Barocas put it, on "the privacy implications of using public data to predict an individual's private propensities." There has been, recently, an outcropping of data usage cases where the line between what is public and what is private were blurred: the FBI's GPS surveillance, the suicide of Tyler Clementi, Target's recent pregnancy marketing debacle. The Supreme Court voted 8-1 that the car was an extension of the home, the one realm where privacy is generally considered sacrosanct. Generally. The Ravi trial has just gotten underway, and public opinion seems to side against the webcam-happy teen, but the public has been reared on after-school specials. (Ian Parker's piece in the New Yorker paints a more nuanced portrait of the situation and the parties involved.)
The Target case is, to me, the most interesting, because it is an example of Mosaic theory, using big data (here, running analytics against a data warehouse of Guest ID activity) to harvest a wealth of seemingly innocuous public information that nonetheless allows the harvester to infer potentially sensitive information about specific customers. Illegal? No--I'm sure Target has a very thorough terms of service buried somewhere on its site. But unethical? Maybe. Daniel Tunkelang tweeted that banning inference is akin to thought crime, and I see his point, but if the inference is algorithmically derived, is it thought or fact? Barocas said that Target's recalcitrance to ask its customers "are you pregnant" should have been an indication that the question was too sensitive to infer. I agree. Tip for man and machine alike: never ask a woman if she's pregnant!
The internet wants to be free, but many of its users want their data to be freed, and given the potentially brutal results of its being accessed and used to identify (imprisonment, loss of employment, destruction of property, even death), one can hardly fault them.
However, while data can be used against individuals, in the aggregate, the solutions it produces can be critical. Another of the day's sessions,"It's Not [Junk] Data Anymore," with Ben Goldacre, Kay Thaney, and Mark Hahnel, approached the public/private issue from a research perspective. When it comes to data sharing, researchers are at the opposite end of the spectrum from the general electronic device user: they have to be coaxed to share all but the most glorious results. Goldacre noted that between 1/3-2/3 of medical trials don't get published, and these tend to be the ones with actively negative or statistically insignificant results. Add that to science journals' penchant for false positives, and the average citizen has access to a worrisomely incomplete or inaccurate portrait of disease and medications.
To fill it in, Goldacre suggested that we "encourage sharing, mandate publication, and provide a common structure" for raw research, which tends to be ill-suited for long papers, anyway. I fear mandated publication could backfire, but the the other components are promising. Hahnel's company Figshare makes it very simple for researchers to upload their research in a variety of forms, and provides them with a breadth of metrics as carrots. A simple approach with a clear goal and a clean UI, it seems to be garnering a lot of attention in the data science world, and I hope it has legs.
My last session of the day was Robbie Allen's "From Big Data to Big Insights." Allen's company, Automated Insights, makes a software that creates automated content for a wealth of sports blogs, real estate and neighborhood watch blogs, financial tear sheets, insider trading reports, weather sites and other news sources that traditionally have a high proportion of quantitative content. As a writer and linguistic enthusiast, I find the concept of automated content, which assigns anything from a quotation mark up to several paragraphs to a key value, both fascinating and frightening. On the one hand, it's a great grunt work tool, but its ability to mimic the styles of human authors could lead to a map > territory situation. If the simulacra is good enough, what chance has the real?
That being said, if software can mimic an individual author's style, perhaps it can also scramble or dilute it. As Solon Barocas said, anonymizing is extremely difficult; using an automated content program to spin stories from bare gists would be a godsend for those who want/need their words to remain anonymous.
In MacLuhan's global village, do the doors have locks?