A chronological, abridged version of my blog series, which you can read here.  

Before we begin, the bullets, for those in a hurry: 

  • We're algorithmically determining whether a given email to a given user is important or not important to that user.
  • Given sparsity of training data, this algorithm starts out unsupervised and becomes supervised over time.
  • The best way we've found to do this is through a neural net that takes certain email features as inputs -- primarily the sender and N-grams of the ( body | subject ) string.


The crux of my research is this: Given a bunch of texts, (could be anything: books, text messages, facebook posts, whatever ~) can we have a computer figure out which ones are important to you, yes, you and which ones are not? 

This particular problem has some interesting features. First of all, at its heart we have a binary classification problem: text can fall into the category of "Important"/"Interesting", or "Not". This fact initially gave me lots of hope, because there exist a plethora of off-the-shelf-ready techniques for performing arbitrary binary classification. So all I'd need to do is define a feature set, collect a bunch of training data, and boom: we're done. 

Kidding. Unfortunately, the idea of training data qua training data contradicts the nature of the problem we are trying to solve, for "Important" is not an objective classification. In fact, from one person to the next, there may be a 100% inversion of "important" to "not" on certain text. Trying to build general rules based on large amounts of training data will lead to very unpredictable results, and it will only be correct when all of the training data is in agreement about certain types of text (e.g., if everyone hates text that contains the word "pineapple", then the classification will do just fine for texts that contain "pineapple". But I don't think I need to convince you that there would be very few, if any, cases like this). 

What we need is a set of training data for each individual end user.  Simple enough, right?

Wrong. Because said training data would need to be robust, by which I mean: there has to be enough of it, it has to be consistent, and it should be general enough to speak to all features in the particular domain we are classifying.  For example, if we showed a user ten spiderman comic books and had them tag them as important or not, then tried to guess how the user would feel about "A Tale of Two Cities", we'd essentially be just guessing.

This type of data, as you might suspect, is too hard to come by to use as the trainer for our classifier.

So what, then? 

What we need is a new kind of algorithm. We need an algorithm that takes, as input, features and an expectation of their influence toward classification. The algorithm must be able to account for instances of new training data that appear, weight them with our feature-weight expectations, adjust our current feature weight expectations, and proceed to classify. Essentially, we have an algorithm that starts as completely unsupervised, but slowly gains supervision over time. 

The real gotcha is, when we're at the completely unsupervised phase, even if we can linearly separate all text into two clusters, the algorithm needs to have some hint about which category corresponds to which cluster. 

Therefore, the algorithm needs to be able to account for the experimenter's a priori belief about the effect of certain features in the data. Or, more ideally, it needs to have some way to guess which features are more likely to contribute to a certain classification and give them initial weights accordingly. 

How's It Gonna Be:

Back propagation isn't for the faint of heart, nor for those who don't like math all that much. However, I envision that there is an incredibly flexible model that can be built using the back-prop algorithm as a base-line. Framing the problem as a neural network allows us to exploit any and all relationships between any and all features. It also allows us to gradually add in data and continuously train on more data as it arrives.

Without giving away the farm, here's the basic outline:

The in's: We have a variable number of input nodes, each representing one feature. 

The not-quite-in's and not-quite-out's : Then we have the hidden layer. Or several hidden layers. This parameter might need to be tuned either live or a priori in order to maximize run-time performance.  Each hidden node has an array of FeatureConverters: objects that take as input a certain feature (identifiable via featureID or featureType or whatever), the actual observed value for that feature, and spits out the current weight assigned for that feature at that weight. 

The out's: Lastly, we've got the output nodes. Two of them, to be precise. One is proudly labeled "Important", the other: "not". 

How this works, in practice:

For the sake of analogy (and, again, not to give away the farm), suppose we were classifying books. We receive a pile of books, and have no idea which ones the user likes. We are told to make the neural net classify them as "Interesting" or "not". Suppose also that we are told which ones the user has read cover to cover, and which ones haven't been touched. There's our binary (read: 1, unread:  0) attribute.  Okay, now, remove it. If you don't, the impact of the other nodes (say, "written by Dickens") is basically nil.

In other words, we started with this chain: Features of book (author, length, style, etc...) ===> Read-cover-to-cover? ===> Interesting/Not

And removed the middle node!

Features of book (author, length, style, etc...) ===> Interesting/Not

This way, the network will determine which of the other features of the book gave rise to the classification which coincides with the user having read the book cover to cover. If our assumption is correct, this will give us a clue which features of the book actually make it interesting for the reader.

Details, details.

From whence, the feature set? 

Alright, I'll give them to you -- just remember, you asked. Here's a pretty picture:

skimbox neural net.png

Doesn't that look tasty? I had something like that for dinner last night.

Anyways, that tasty plate of spaghetti and meatballs is in fact a bayesian network of the conditional dependencies of the factors which contribute to the classification of an email. In probabilistic terms, this equates to:

Pr(classification | attachments, trigrams(body), trigrams(subject), cc, bcc, from, In-Reply-To, seen) * Pr(from) * Pr(attachments | from) * Pr (trigrams(body) | bigrams(body)) * Pr(bigrams(body) | monograms (body)) * Pr(trigrams(subject) | bigrams(subject)) * Pr(bigrams(subject) | monograms(subject)) * Pr(monograms(body) | In-Reply-To) * Pr(monograms(subject) | from) * Pr(seen | trigrams(subject)) * Pr(monograms(subject) | In-Reply-To) * Pr(cc | trigrams(body)) * Pr(bcc | trigrams(body))

In dog terms, this equates to:


I actually wasn't trying to find an exact probabilistic model for classification. That would be kind of dumb in this domain, since one of the initial hurdles is the unfortunate lack of data. Without data, there is no probability that can be determined. Rather, the reason I drew all this spaghetti was because I wanted a clear picture of the factors which influence the classification of an email, and which ones we don't need to care about.

According to the graph, just about everything comes down to the sender. That makes sense intuitively, since that is perhaps the first thing we look at when determining whether an email is important or not.

Therefore, I designed the neural net to consider only the following: Sender, N-grams. I threw in a few of the other headers (cc, bcc), but primarily the classification is performed by examining the sender and N-grams of the ( body | subject ) string.

There you have it, folks. Well, not all of it. We're currently working on other types of classification, some of which require processes almost as complex as the important/!important, others which are, blessedly, simpler. Details on those TK.