Data is a trending topic right now, and data privacy is one of its trendiest subsets. To wit, Charles Duhigg’s investigative report on Target’s data mining for the New York Times spawned a series of follow-ups, in March, The Atlantic profiled  NYU Law professor Helen Nissenbaum and her flow-based privacy framework,  and the FTC just published a privacy report endorsing privacy-by-design and the “Do Not Track” button. The demarcation line between what should be public vs private is a dynamic and jagged (some might say gerrymandered) one that depends on a piece of data’s original context vs the contexts in which it is eventually used. It seems perfectly reasonable for Foursquare to publish its users’ locations but less reasonable for a third-party dating application like Girls Around Me to provide these locations, along with Facebook profile photos, to its users. It seems reasonable that an online money management service like Mint serves up ads tailored to users’ credit ratings, but less reasonable that banks determine applicants’ loan rates based on their Facebook friends’ credit ratings. Because we’re storing and analyzing corporate email, user privacy is something that we have to get right. Of course, an employer’s definition of “right” might be different than the employee’s, so we’ve been trying to figure out a definition that will please both. Companies are legally permitted to access their employees’ email, and usually this manifests in explicit/inappropriate language monitoring. As long as employees are aware of the monitoring, this sort of vocab dinging seems reasonable. But what about sentiment analysis, and the inferred knowledge of employees’ mind states it provides? Invaluable to the company, I think, but potentially detrimental, and sometimes errantly so, to the employee.  Does explicit consent justify armchair psychology and any actions that result?  Even if employees are fully and duly informed of all monitoring and tracking practices, I’m not sure. Take, for example, Cataphora.

Cataphora is a “behavioral modeling and monitoring” software that analyzes employees’ digital and mobile actions from legal, risk, compliance, HR, and brand management perspectives. The copy on its website doesn’t even try to address employees—there are callouts on its news page to articles with titles like “In Defense of Employer Monitoring,” and “Finding Office Buck-Passers, Heroes, and Shirkers.” If employers are not monitoring employees’ digital activity, Cataphora CEO Elizabeth Charnock argues, they are making themselves vulnerable to leaks, blow-ups, and Youtube frittering-induced productivity slumps. In a blog post entitled “Getting Big Brother Right,” Rick Janowski brought up as a use case an employee on the verge of a breakdown due to non-work-related factors. Cataphora could identify and alert management to the employee’s mental state, allowing them to “provide a safety net for someone who might be prone temporarily to making bad decisions or being less diligent than they normally would be.” Aka remove him from fiscal and legal harm’s way before it’s too late. Ooh, Carnival Cruise is having a flash sale! I hear Alaska’s great this time of year!

You could argue that behavioral mining software is just one of the many new “transparent” office measures, which manifest physically in concepts like open and free range offices (a different desk every day!), and culturally in social enterprise platforms like Yammer, Rypple, and Trello. There’s been a push, lately, to besmirch the traditional office, with its many doors and walls and silos. Which is all very well and fine, but there is a point where public property ends and person begins. Perhaps the central tower is too zoomed in to see it.

Posted
AuthorClaire Willett

Web mining consists of crawling the world wide web, and extracting and processing its structure, usage, or content. This tutorial, taught by Scale Unlimited founder Ken Krugler (@kkrugler), focused on content mining at a large scale (as you might surmise from its title). After a thorough overview of web mining, we did a focused crawl for ultimate frisbee images. The first step in any type of web mining is crawling, or fetching and parsing web pages. There are four types of web crawls: broad (e.g. bingbot), focused, domain, and what Krugler calls the "don't crawl" crawl, wherein you leverage other people's crawl data, which besides usually being faster and cheaper, also reduces the load on your crawlees' servers. If you go for this option, you can either use public datasets like CommonCrawl or Wikipedia or commercial providers like Spinner and Infochimps.

Per Krugler, the general rule of crawling solutions is: don't roll your own! No matter your language, there's an open source option. Java has Nutch and Heritrix, Python has Scrapey, PHP has the more literally named php-crawler. Whichever solution you end up going with, make sure it's reliable, scalable, and fault-tolerant. Sure, a single server can fetch lots of pages, but scaling becomes an issue with post-processing. 

Keep in mind that unless you're creating an index of the crawled sites, you're breaking the implicit "traffic-for-bandwidth" contract, and therefore risk having the wrath of the webmaster reign down on your inquisitive shoulders.

Today's lab had us doing a focused crawl, so Krugler took us through the basics. First, you need to start with good seed urls, or really high-quality pages. You can use a paid/free service to find these, or you can use search and manually enter them or make calls to the APIs. After you have your seeds, you fetch and parse new urls and give them a page score.  When parsing, be sure to normalize the outlines (e.g. http://basketball.com becomes http://www.basketball.com), and use a suffix filter to skip links to low-value pages like images and pdf files.

You can score pages by tokenizing their text, using simple term-based scoring or using a Support Vector Machine, which is trained using "documents" that have features and a positive or negative class, creates a statistical model that divides all the documents into separate positive or negative classes, and then uses this model to assign unknown documents one of these classes.

Graphics and small amount of content (e.g. a definition page)are some of the wrenches in determining page quality. You can filter these pages out by setting a minimum real content threshold. By "real content," Krugler means the stuff that's not Chrome, cruft or boilerplate. You can scrape off this junk with Boilerpipe.

After you have your page score, it's time to extract and score its outline before putting the url into the url state, or the database of all known urls. There are three attributes of extraction: broad, precise, and accurate. You get to pick two, and these depend on whether you're extracting unstructured (broad, accurate), semi-structured (broad, precise), or structured (precise, accurate) data.

Usually, no matter which type of data you're extracting, you'll need:

  • To clean the html. You can use a library like NekoHTML for this. Keep in mind that the end result won't match the original text.
  • A charset to convert bytes to characters. Tika works pretty well.
  • A boilerplate scraper like Boilerpipe.
  • Some means of identifying the page's language (http response header, meta tag, tag attribute, or text analysis).

If you're doing unstructured extraction, your goal is to extract text without much additional processing (often there are just a few HTML fields). If you're doing semi-structured extraction, your goal is to find data in random text. Since it's not very format-specific, this can be applied broadly, often at the expense of accuracy. Easy patterns like telephone numbers, micro formats, and NLP named entities all usually work well with semi-structured extraction.

If you're doing structured extraction, your goal is to extract specific types of data, typically from one area of one site. You often do this with XPath, and if this is the case, Firebug is your friend, as are the div and span tags. If you run into pages that generate javascript, you will need Firebug or an equivalent to inspect the DOM. Options for precessing pages with javascript include HTMLUnit, qt-webkit and headless Mozilla, but keep in mind that processing a page with javascript takes about 10 times longer than just loading page text.

 

Posted
AuthorClaire Willett