A blog about security, privacy, algorithms, and email in the enterprise. 

Viewing entries tagged


Data Privacy, Cataphora, and the Maginot Line between Transparent and Intrusive

Data is a trending topic right now, and data privacy is one of its trendiest subsets. To wit, Charles Duhigg’s investigative report on Target’s data mining for the New York Times spawned a series of follow-ups, in March, The Atlantic profiled  NYU Law professor Helen Nissenbaum and her flow-based privacy framework,  and the FTC just published a privacy report endorsing privacy-by-design and the “Do Not Track” button. The demarcation line between what should be public vs private is a dynamic and jagged (some might say gerrymandered) one that depends on a piece of data’s original context vs the contexts in which it is eventually used. It seems perfectly reasonable for Foursquare to publish its users’ locations but less reasonable for a third-party dating application like Girls Around Me to provide these locations, along with Facebook profile photos, to its users. It seems reasonable that an online money management service like Mint serves up ads tailored to users’ credit ratings, but less reasonable that banks determine applicants’ loan rates based on their Facebook friends’ credit ratings. Because we’re storing and analyzing corporate email, user privacy is something that we have to get right. Of course, an employer’s definition of “right” might be different than the employee’s, so we’ve been trying to figure out a definition that will please both. Companies are legally permitted to access their employees’ email, and usually this manifests in explicit/inappropriate language monitoring. As long as employees are aware of the monitoring, this sort of vocab dinging seems reasonable. But what about sentiment analysis, and the inferred knowledge of employees’ mind states it provides? Invaluable to the company, I think, but potentially detrimental, and sometimes errantly so, to the employee.  Does explicit consent justify armchair psychology and any actions that result?  Even if employees are fully and duly informed of all monitoring and tracking practices, I’m not sure. Take, for example, Cataphora.

Cataphora is a “behavioral modeling and monitoring” software that analyzes employees’ digital and mobile actions from legal, risk, compliance, HR, and brand management perspectives. The copy on its website doesn’t even try to address employees—there are callouts on its news page to articles with titles like “In Defense of Employer Monitoring,” and “Finding Office Buck-Passers, Heroes, and Shirkers.” If employers are not monitoring employees’ digital activity, Cataphora CEO Elizabeth Charnock argues, they are making themselves vulnerable to leaks, blow-ups, and Youtube frittering-induced productivity slumps. In a blog post entitled “Getting Big Brother Right,” Rick Janowski brought up as a use case an employee on the verge of a breakdown due to non-work-related factors. Cataphora could identify and alert management to the employee’s mental state, allowing them to “provide a safety net for someone who might be prone temporarily to making bad decisions or being less diligent than they normally would be.” Aka remove him from fiscal and legal harm’s way before it’s too late. Ooh, Carnival Cruise is having a flash sale! I hear Alaska’s great this time of year!

You could argue that behavioral mining software is just one of the many new “transparent” office measures, which manifest physically in concepts like open and free range offices (a different desk every day!), and culturally in social enterprise platforms like Yammer, Rypple, and Trello. There’s been a push, lately, to besmirch the traditional office, with its many doors and walls and silos. Which is all very well and fine, but there is a point where public property ends and person begins. Perhaps the central tower is too zoomed in to see it.


1 Comment

Strata Day 3 Recap: Privacy, Junk, and 1 Million Monkeys

  The sessions I attended on the final day of the Strata Conference converged around ethicality, legality, and human nature. Earlier, someone tweeted that the data is here, and the talent will catch up. This is true, and the real question is, once the talent's caught up, what will they do with their catch (or cache)? It's a question of volition, not ability, and as such, it is rather difficult to answer. My first session of the day, "If Data Wants to Be Free, Is Privacy a Prison?" focused, as Solon Barocas put it, on "the privacy implications of using public data to predict an individual's private propensities." There has been, recently, an outcropping of data usage cases where the line between what is public and what is private were blurred: the FBI's GPS surveillance, the suicide of Tyler Clementi, Target's recent pregnancy marketing debacle. The Supreme Court voted 8-1 that the car was an extension of the home, the one realm where privacy is generally considered sacrosanct. Generally. The Ravi trial has just gotten underway, and public opinion seems to side against the webcam-happy teen, but the public has been reared on after-school specials. (Ian Parker's piece in the New Yorker paints a more nuanced portrait of the situation and the parties involved.)

The Target case is, to me, the most interesting, because it is an example of Mosaic theory, using big data (here, running analytics against a data warehouse of Guest ID activity) to harvest a wealth of seemingly innocuous public information that nonetheless allows the harvester to infer potentially sensitive information about specific customers. Illegal? No--I'm sure Target has a very thorough terms of service buried somewhere on its site. But unethical? Maybe. Daniel Tunkelang tweeted that banning inference is akin to thought crime, and I see his point, but if the inference is algorithmically derived, is it thought or fact? Barocas said that Target's recalcitrance to ask its customers "are you pregnant" should have been an indication that the question was too sensitive to infer. I agree. Tip for man and machine alike: never ask a woman if she's pregnant!

The internet wants to be free, but many of its users want their data to be freed, and given the potentially brutal results of its being accessed and used to identify (imprisonment, loss of employment, destruction of property, even death), one can hardly fault them.

However, while data can be used against individuals, in the aggregate, the solutions it produces can be critical. Another of the day's sessions,"It's Not [Junk] Data Anymore," with Ben Goldacre, Kay Thaney, and Mark Hahnel, approached the public/private issue from a research perspective. When it comes to data sharing, researchers are at the opposite end of the spectrum from the general electronic device user: they have to be coaxed to share all but the most glorious results. Goldacre noted that between 1/3-2/3 of medical trials don't get published, and these tend to be the ones with actively negative or statistically insignificant results. Add that to science journals' penchant for false positives, and the average citizen has access to a worrisomely incomplete or inaccurate portrait of disease and medications.

To fill it in, Goldacre suggested that we "encourage sharing, mandate publication, and provide a common structure" for raw research, which tends to be ill-suited for long papers, anyway. I fear mandated publication could backfire, but the the other components are promising. Hahnel's company Figshare makes it very simple for researchers to upload their research in a variety of forms, and provides them with a breadth of metrics as carrots. A simple approach with a clear goal and a clean UI, it seems to be garnering a lot of attention in the data science world, and I hope it has legs.

My last session of the day was Robbie Allen's "From Big Data to Big Insights." Allen's company, Automated Insights, makes a software that creates automated content for a wealth of sports blogs, real estate and neighborhood watch blogs, financial tear sheets, insider trading reports, weather sites and other news sources that traditionally have a high proportion of quantitative content. As a writer and linguistic enthusiast, I find the concept of automated content, which assigns anything from a quotation mark up to several paragraphs to a key value, both fascinating and frightening. On the one hand, it's a great grunt work tool, but its ability to mimic the styles of human authors could lead to a map > territory situation. If the simulacra is good enough, what chance has the real?

That being said, if software can mimic an individual author's style, perhaps it can also scramble or dilute it. As Solon Barocas said, anonymizing is extremely difficult; using an automated content program to spin stories from bare gists would be a godsend for those who want/need their words to remain anonymous.

In MacLuhan's global village, do the doors have locks?

1 Comment