social graph classification

[Editor's note: As I discussed in Parts 1 and 2, now that we've gotten Gander's baseline functionality up and running, we're moving on to the fun stuff. Aka out of chaos, order. Aka prioritization. At-bat today is more classification, here using social graphing database Neo4j, its Ruby wrapper Neography, and its query language Cypher.]

There are two other currently defined means of automated classification (Naive Bayesian, and Regex). Each has distinct advantages, but neither is particularly good at promoting messages to a higher priority. From the research, a social graph holds the most promise of classifying email that requires more user attention.

Background Both Minkov and Dredze talk about using social graphs to classify email. It seems natural and intuitive - the messages read most readily are typically:

  1. in response to a sent message or part of an active dialogue.
  2. with frequent and common correspondants.

Identifying and classifying these requires pulling in a user's Sent Items in addition to their incoming email. Only the headers need be retrieved as we are simply interested in who is part of the social graph, not in what they are saying.

Keeping statistics on frequent correspondants could be done using a relational database. However there are significant advantages to using a graph database, as it provides an excellent infrastructure for arbitrary future classifications and a better exploration of the corpus using Minkov's Adaptive Graph Walks.

The Experiment I loaded up all 1500 of my Riparian Data sent items using a modified version of Mike Leonard's backup script. In order to load Sent Items from Gmail, the folder name must be "[Gmail]/Sent Mail". This returns all messages in a separate file per message, including the message body contents. I wrote a little script to extract just the Subject:, Date:, From:, and To: headers from each of the messages, which is the equivalent of the header subset that Activesync provides.

#!/bin/bashi = 0 for msg in SentItems/*.mbox; do headername=$(printf 'MSG%05d.msg' $i) echo "$headername: $msg" grep -m 4 -i -e "^To:" -e "^From:" -e "^Cc:" -e "^Date:" "$msg" > "ToFrom/$headername" i=$(( $i + 1 )) done

Installing the Graph Database After spending a few hours examining different graph database options, I elected to use Neo4J as the database. It is open source and free for development, but requires an OEM license for commercial usage. I was happy to see that enterprise versions and 7/24 support are available as potential options in the future. It also seems to have the most comprehensive toolset, language support, and maturity. I used Community Edition 1.8. The install was super easy. Just extract the zip to an arbitrary directory, and then enter "$ bin/neo4j start". There is only a single graph database per directory. If a server has to host multiple databases, then multiple instances of Neo4J have to be installed in separate directories and the default HTTP listening port has to be changed to a new value per database. To delete the database in order to start fresh (something I had to do several times until I got the hang of it), simply stop the server ("$ bin/neo4j stop"), delete the Neo4J directory and reinstall the .zip. By default, the server can be contacted via http://localhost:7474/webadmin which brings up the admin console and allows ad hoc queries. Ad Hoc queries can be done using a Neo4J SQL-like specific language ("Cypher"), as well as REST, Java and Ruby. I elected to use Ruby, specifically Max de Marzi's Neography. Graph databases are interesting in that there is no typing. There are simply nodes and relationships. Nodes have properties. Relationships have properties. Any node property can be indexed. Any relationship property can be indexed. Neo4J includes Lucene as an indexing and search engine. Any node can be related to any other node via any relationship. Loading the Emails and RelationshipsI got bogged down in a lot of useless email parsing code. Rather than continue to waste time with problems tangential to the experiment at hand, I simply grabbed several random sent item header files, and manually put just the basic information into a YAML file. A more robust email header parser remains to be done. For this first pass, I chose a minimalist graph model:

email-address -------> message -------> email-address sends receivesI created three indices: email addresses, dates and message id. 'email-address' has only one property: the email address. 'Message' has two properties:

  • MsgId: the unique name of the message file. If I were using CouchDB, this would be the unique id of the document.
  • Date

Walking the Social Graph I used Cypher to walk the social graph. It is a short and powerful declarative language like SQL or SPARQL. An alternative would have been to use Tinkerpop's Gremlin (Romiko Derbynew has a comparison between the two). Gremlin has the potential advantage of working across multiple graph databases, whereas Cypher is specific to Neo4J. I also could have simply walked the graph in Ruby. Note that Ruby and Java can invoke these other languages to return results, much like embedding SQL statements in a Java or C# application.

To see who I most frequently sent messages to:

START n= node(1) // couldn't get lookup by index working yetMATCH n-[:sends]-(msg)-[:receives]-correspondentRETURN correspondnt, count(*)ORDER BY count(*) desc;In other words, start with an emailaddr (node(1)) and walk the 'sends' relationship to the set of messages. From there, walk the 'receives' relationship to correspondents. Count the correspondents. Sort from highest to lowest.

==> +---------------------------------------------------------------+==> | correspondent                                      | count(*) |==> +---------------------------------------------------------------+==> | Node[9]{emailaddr:""}      | 3        |==> | Node[5]{emailaddr:""}       | 1        |==> | Node[12]{emailaddr:""} | 1        |==> | Node[7]{emailaddr:""}       | 1        |==> | Node[3]{emailaddr:""}           | 1        |==> +---------------------------------------------------------------+

That is now my set of most frequent destination addresses. If an incoming email matches someone on this list, the message would be prioritized higher. One could add a weighting factor for how high they should be prioritized based on how far down the list the correspondent is, or how recently a message was sent.

Graph Database Potential As I learn more about graph databases, they seem to offer promise as a potential easy and flexible means of addressing a good part of the above. The idea is that whole messages would be stored in a document store (CouchDB, MongoDB, even potentially flat files), while the graph database would allow arbitrary navigation of the corpus. This certainly looks much more promising than attempting to do this via clunky joins in a relational database, or a complex series of views in CouchDB.

The graph database could also store message previews. When a user displays a message list, they would simply be walking the message graph. When they click on a message to see it completely, it would be retrieved from the document store.

That's the idea at least - it certainly remains to be tested and proven.