Web mining consists of crawling the world wide web, and extracting and processing its structure, usage, or content. This tutorial, taught by Scale Unlimited founder Ken Krugler (@kkrugler), focused on content mining at a large scale (as you might surmise from its title). After a thorough overview of web mining, we did a focused crawl for ultimate frisbee images. The first step in any type of web mining is crawling, or fetching and parsing web pages. There are four types of web crawls: broad (e.g. bingbot), focused, domain, and what Krugler calls the "don't crawl" crawl, wherein you leverage other people's crawl data, which besides usually being faster and cheaper, also reduces the load on your crawlees' servers. If you go for this option, you can either use public datasets like CommonCrawl or Wikipedia or commercial providers like Spinner and Infochimps.
Per Krugler, the general rule of crawling solutions is: don't roll your own! No matter your language, there's an open source option. Java has Nutch and Heritrix, Python has Scrapey, PHP has the more literally named php-crawler. Whichever solution you end up going with, make sure it's reliable, scalable, and fault-tolerant. Sure, a single server can fetch lots of pages, but scaling becomes an issue with post-processing.
Keep in mind that unless you're creating an index of the crawled sites, you're breaking the implicit "traffic-for-bandwidth" contract, and therefore risk having the wrath of the webmaster reign down on your inquisitive shoulders.
Today's lab had us doing a focused crawl, so Krugler took us through the basics. First, you need to start with good seed urls, or really high-quality pages. You can use a paid/free service to find these, or you can use search and manually enter them or make calls to the APIs. After you have your seeds, you fetch and parse new urls and give them a page score. When parsing, be sure to normalize the outlines (e.g. http://basketball.com becomes http://www.basketball.com), and use a suffix filter to skip links to low-value pages like images and pdf files.
You can score pages by tokenizing their text, using simple term-based scoring or using a Support Vector Machine, which is trained using "documents" that have features and a positive or negative class, creates a statistical model that divides all the documents into separate positive or negative classes, and then uses this model to assign unknown documents one of these classes.
Graphics and small amount of content (e.g. a definition page)are some of the wrenches in determining page quality. You can filter these pages out by setting a minimum real content threshold. By "real content," Krugler means the stuff that's not Chrome, cruft or boilerplate. You can scrape off this junk with Boilerpipe.
After you have your page score, it's time to extract and score its outline before putting the url into the url state, or the database of all known urls. There are three attributes of extraction: broad, precise, and accurate. You get to pick two, and these depend on whether you're extracting unstructured (broad, accurate), semi-structured (broad, precise), or structured (precise, accurate) data.
Usually, no matter which type of data you're extracting, you'll need:
- To clean the html. You can use a library like NekoHTML for this. Keep in mind that the end result won't match the original text.
- A charset to convert bytes to characters. Tika works pretty well.
- A boilerplate scraper like Boilerpipe.
- Some means of identifying the page's language (http response header, meta tag, tag attribute, or text analysis).
If you're doing unstructured extraction, your goal is to extract text without much additional processing (often there are just a few HTML fields). If you're doing semi-structured extraction, your goal is to find data in random text. Since it's not very format-specific, this can be applied broadly, often at the expense of accuracy. Easy patterns like telephone numbers, micro formats, and NLP named entities all usually work well with semi-structured extraction.