naive bayesian classification email

[Editor's note: As I discussed in Part 1, now that we've gotten Gander's baseline functionality up and running, we're moving on to the fun stuff. Aka out of chaos, order. Aka prioritization. In this post, David tackles prioritization by categorization, here carried out with Naive Bayesian classification.]

Background Bayesian Classification has been used quite successfully as one of the techniques for SPAM filtering. There are many freely available classifiers for various languages. For this exercise, I used the most common Ruby classifier.

The Experiment I took a sample of 33,000 of my message headers. I grabbed a random set of 125 of these message headers and then manually categorized them into the following buckets: Reply/Conversation, Graymail, Skim-only, JIRA/Confluence Updates, and Calendar Notices. As a human, in approximately 95% of the cases I was able to categorize solely based on header. The category "Graymail" includes both unable to categorize and those that fall in between "Reply/Conversation and "Skim-only." Note 2: This data set of 125 messages is not necessarily representative of anyone's typical email, including my own.

The Code I used csplit1 to break the large message header file into one file per message. I then created a sub-directory for each of the training data categories. I manually viewed a set of message files and moved them into the appropriately categorized directory, ie.

mv XX12345 0reply; mv XX23456 2skimI then re-assembled the training set and the unsorted set back into single files using cat

cat 0reply/* > 0reply.txtSo I now had one very large sample data file with the training messages removed, and 5 training files containing for each category.

The next step was converting the message file into a YAML acceptable format.2 I used the "- |" YAML syntax to preserve line breaks. iconv broke several times when hitting illegal characters like the (r) symbol. Since sed could not handle this either, I manually edited some of the files to remove the offending characters using vi. These illegal UTF-8 characters might be an artifact of the Unicode to ASCII conversion done by WordPad. It would be best if everything could be left in Unicode in the first place. Another time.

The actual Ruby code was very simple. I was very pleased that Ruby, the Classifier and YAML could swallow 50 MB of email in just a second or so on my Macbook.3  The literature was not exaggerating when multiple authors said that Bayesian Classification is very fast.

As the code runs, it first loads up the 5 different training sets. Then it loads in the big unsorted file. It iterates through each unsorted message and displays the message and the Classifier result. I then manually kept track of my ranking vs. the Classifier's and analyzed the results with a very simple Excel Pivot Table.

The Results After checking 40 messages, the results were (# of messages - %): So the Classifier had high accuracy for JIRA and Calendar messages but was all over the place for the other categories. Interestingly, the categories for JIRA and Calendar had smaller training files but those files were very consistent.

What Went Wrong In this first experiment while learning Bayesian Classification, I can see a number of errors in the method:

  • The sample training data was not random enough.
  • The sample training data set size was too small. Many articles recommend a training set in thousands or tens of thousands. According to the research, the more data the classifier has, the better the results.
  • The sample training data was not balanced enough (too much in a single category).

Next Steps In researching Bayesian Classification, there are a number of techniques to apply in order to significantly improve accuracy:

  • Re-run the experiment with better training data.
  • Gather email headers including message preview via IMAP from Gmail in order to have a broader set of both training and unsorted messages.
  • Create more categories for consistent sets of email. The overall process might work better if there are more precisely defined subcategories that are easy to classify, and are then lumped together into a major category when displayed to the user.
  • Define and experiment with weighting criteria. For example a specific X-Mailer value is a much stronger signal than the rest of the text.
  • Include the first 'n' characters of the email body (probably 250) as input to the classifier, rather than just the header. Apparently, short instances like email are harder to classify.
  • Remove "noisy" headers that are not useful to categorization but are factored into the word distribution.
  • Investigate unsupervised clustering techniques to complement predetermined or user defined classification.

Summary For a single day of work, while learning Ruby, Bayesian Classification and wrangling lots of email with shell scripts, the overall results seem far in advance of what would have been possible had I started hand coding classification logic. There are two nice results: 1) better training data leads to better results without requiring substantial additional coding, 2) Naive Bayesian Classification can be used in conjunction with other techniques to get a much more accurate classification.

References (1)

csplit --digits=5 pawa.txt '/Microsoft Mail Internet Headers Version 2.0/' {*}(2)

sed 's/./  &/' | sed 's/  Microsoft Mail Internet Headers Version 2.0/- |/' | iconv -f UTF-8 -t ISO-8859-15(3) There is no reason to load the entire unsorted message file in practice. I simply already had it in YAML from the prior manipulations.