Lies, Damn Lies, and the New York Post: Tracking Marathon Misinformation on Twitter

I spent much of yesterday afternoon glued to Twitter, constantly clicking the 20 new tweets link, trying to find credible information about the events in Boston. Let me tell you--between celebrity thoughts and prayers simperings and breathless NY Post reports of double-digit death tolls, it was not easy going. Since the Arab Spring, "Twitter is great for communication during dangerous events" has been a popular refrain. It's true, but what's being communicated during these events sometimes isn't. 

The spread of misinformation on Twitter drew a lot of notice from the mainstream media during Hurricane Sandy, when a user named @comfortablysmug started disseminating reports of the Stock Exchange being flooded and Governor Cuomo being trapped and Con Ed shutting down all power in lower Manhattan. Smug's tweets caught the eye of Buzzfeed reporter Andrew Kaczynski who, after confirming their baselessness, outed their author as one Shashank Tripathi, "the campaign manager of Christopher R. Wight, this year’s Republican candidate for the U.S. House from New York’s 12th Congressional District."

Kaczynski's admirable reporting resulted in deserved pillorying and, more importantly, ensured Smug's silence thereafter. But here are the rubs: 1) misinformation, especially of the hyperbolic kind, spreads faster than its corrections, and 2) there are countless Smugs, many of them with a good deal more followers.  By the time Kaczynski managed to take a screenshot of Smug's tweet about the MTA shutting down all service for the remainder of the week, it had been retweeted 540 times. Yesterday, the New York Post's tweet reporting 12 dead at the marathon had been retweeted 1722 times as of 5:15 PM. The Boston Police Department's report at that same time: 2 dead (it has since been updated to 3). The police department also brushed aside the Post's report of the suspect being a  "Saudi national who suffered shrapnel wounds in today's blast [and] is currently being guarded in a Boston hospital," saying "Honestly, I don't know where they're getting their information from, but it didn't come from us."

I know the Post loves blood, but this is disgusting. More than that, it's harmful, as evinced by Anti-Arab reactions to the "breaking" Saudi tweet. I'm obviously no journalist, but back in high school, I did do a stint at my town newspaper. The very first thing I was told: check your sources. The second: cite them. The third: no confirmation? No story. Such commandments are clearly not in use at the Post, or at Before It's News, or Fox 11, where "Breaking" seems to be the newest euphemism for "spurious allegations."

The thing is, of course, that there were plenty of credible, on-the-ground sources providing information via Twitter yesterday (side note: someone should give about-to-graduate NU senior Taylor Dobbs a job; his reports were and continue to be clear-eyed and comprehensive). So why, then, did people retweet aggregators like Before It's News or truth-be-damned hype machines like the Post? 

The answer, in short, is: they saw the rumor tweet first. In "Why Rumors Spread Fast in Social Networks," Doerr et al find that on a social graph, given n modes with a density of more than 1, "after a surprisingly short time a news [story] spreads to all nodes." Generally, a rumor starts with user with a small number of followers. One reason for this is that users with a small number of followers pick up a rumor from one of their followers and quickly pass it on, acting as "an automatic link between neighbors." The second reason is that, once a popular user picks up the rumor, after a few rounds "all popular nodes are informed." Interestingly, a rumor that starts with a small degree node spreads to popular nodes faster, after which the remaining small degree nodes all become informed. 

If you skipped all that, the summary is:

Rumor spreading is extremely fast in social networks.

 Why rumors, though? In Rumor Psychology, DiFonzo & Bordia says they function "to help people make sense and manage risk," but I'd say the same can be said for any updating news, rumor or true. During a disaster, or any event where facts are presently few, people will scrabble for any information they can find, and spread it so the burden of knowledge isn't theirs alone. 

During the UK Riots, the Guardian published a fascinating visualization of the flow--and ebb--of rumors on Twitter. A story like "Police beat 16 year old girl" gets tweeted, generally by someone with a respectable amount of followers, then is picked up by followers and followers' followers and so on. Sometimes, it is then picked up by shame shoddy publication like the Daily Mail; but sometimes, it is questioned, after which point it gradually dies out. 

That last bit is a pleasant take away, and it's not unique to the rumors the Guardian studied. Indeed, a recent study of rumor tweets during the Japanese earthquake found that if you call out a rumor tweet as such, you can help kill it. When receivers see a criticism before the original tweet, the likelihood of their spreading the rumor decreases, and the likelihood of their stopping the rumor increases by 150%

We can't predict what we'll see on Twitter at any given point, but we can make sure to follow or keep a list of reputable journalists, and we can try to stay skeptical, especially during high anxiety events. 

So, to meld a bunch of highfalutin' advice: in times of trouble, stay classy, readers, and beware the irrational, however seductive.

How to Prioritize an Inbox, Part 3: Social Graph Classification

[Editor's note: As I discussed in Parts 1 and 2, now that we've gotten Gander's baseline functionality up and running, we're moving on to the fun stuff. Aka out of chaos, order. Aka prioritization. At-bat today is more classification, here using social graphing database Neo4j, its Ruby wrapper Neography, and its query language Cypher.]

There are two other currently defined means of automated classification (Naive Bayesian, and Regex). Each has distinct advantages, but neither is particularly good at promoting messages to a higher priority. From the research, a social graph holds the most promise of classifying email that requires more user attention.

Background Both Minkov and Dredze talk about using social graphs to classify email. It seems natural and intuitive - the messages read most readily are typically:

  1. in response to a sent message or part of an active dialogue.
  2. with frequent and common correspondants.

Identifying and classifying these requires pulling in a user's Sent Items in addition to their incoming email. Only the headers need be retrieved as we are simply interested in who is part of the social graph, not in what they are saying.

Keeping statistics on frequent correspondants could be done using a relational database. However there are significant advantages to using a graph database, as it provides an excellent infrastructure for arbitrary future classifications and a better exploration of the corpus using Minkov's Adaptive Graph Walks.

The Experiment I loaded up all 1500 of my Riparian Data sent items using a modified version of Mike Leonard's backup script. In order to load Sent Items from Gmail, the folder name must be "[Gmail]/Sent Mail". This returns all messages in a separate file per message, including the message body contents. I wrote a little script to extract just the Subject:, Date:, From:, and To: headers from each of the messages, which is the equivalent of the header subset that Activesync provides.

#!/bin/bashi = 0 for msg in SentItems/*.mbox; do headername=$(printf 'MSG%05d.msg' $i) echo "$headername: $msg" grep -m 4 -i -e "^To:" -e "^From:" -e "^Cc:" -e "^Date:" "$msg" > "ToFrom/$headername" i=$(( $i + 1 )) done

Installing the Graph Database After spending a few hours examining different graph database options, I elected to use Neo4J as the database. It is open source and free for development, but requires an OEM license for commercial usage. I was happy to see that enterprise versions and 7/24 support are available as potential options in the future. It also seems to have the most comprehensive toolset, language support, and maturity. I used Community Edition 1.8. The install was super easy. Just extract the zip to an arbitrary directory, and then enter "$ bin/neo4j start". There is only a single graph database per directory. If a server has to host multiple databases, then multiple instances of Neo4J have to be installed in separate directories and the default HTTP listening port has to be changed to a new value per database. To delete the database in order to start fresh (something I had to do several times until I got the hang of it), simply stop the server ("$ bin/neo4j stop"), delete the Neo4J directory and reinstall the .zip. By default, the server can be contacted via http://localhost:7474/webadmin which brings up the admin console and allows ad hoc queries. Ad Hoc queries can be done using a Neo4J SQL-like specific language ("Cypher"), as well as REST, Java and Ruby. I elected to use Ruby, specifically Max de Marzi's Neography. Graph databases are interesting in that there is no typing. There are simply nodes and relationships. Nodes have properties. Relationships have properties. Any node property can be indexed. Any relationship property can be indexed. Neo4J includes Lucene as an indexing and search engine. Any node can be related to any other node via any relationship. Loading the Emails and RelationshipsI got bogged down in a lot of useless email parsing code. Rather than continue to waste time with problems tangential to the experiment at hand, I simply grabbed several random sent item header files, and manually put just the basic information into a YAML file. A more robust email header parser remains to be done. For this first pass, I chose a minimalist graph model:

email-address -------> message -------> email-address sends receivesI created three indices: email addresses, dates and message id. 'email-address' has only one property: the email address. 'Message' has two properties:

  • MsgId: the unique name of the message file. If I were using CouchDB, this would be the unique id of the document.
  • Date

Walking the Social Graph I used Cypher to walk the social graph. It is a short and powerful declarative language like SQL or SPARQL. An alternative would have been to use Tinkerpop's Gremlin (Romiko Derbynew has a comparison between the two). Gremlin has the potential advantage of working across multiple graph databases, whereas Cypher is specific to Neo4J. I also could have simply walked the graph in Ruby. Note that Ruby and Java can invoke these other languages to return results, much like embedding SQL statements in a Java or C# application.

To see who I most frequently sent messages to:

START n= node(1) // couldn't get lookup by index working yetMATCH n-[:sends]-(msg)-[:receives]-correspondentRETURN correspondnt, count(*)ORDER BY count(*) desc;In other words, start with an emailaddr (node(1)) and walk the 'sends' relationship to the set of messages. From there, walk the 'receives' relationship to correspondents. Count the correspondents. Sort from highest to lowest.

==> +---------------------------------------------------------------+==> | correspondent                                      | count(*) |==> +---------------------------------------------------------------+==> | Node[9]{emailaddr:"yyy@ripariandata.com"}      | 3        |==> | Node[5]{emailaddr:"mmm@ripariandata.com"}       | 1        |==> | Node[12]{emailaddr:"rr@softartisans.com"} | 1        |==> | Node[7]{emailaddr:"cc@ripariandata.com"}       | 1        |==> | Node[3]{emailaddr:"ww@gmail.com"}           | 1        |==> +---------------------------------------------------------------+

That is now my set of most frequent destination addresses. If an incoming email matches someone on this list, the message would be prioritized higher. One could add a weighting factor for how high they should be prioritized based on how far down the list the correspondent is, or how recently a message was sent.

Graph Database Potential As I learn more about graph databases, they seem to offer promise as a potential easy and flexible means of addressing a good part of the above. The idea is that whole messages would be stored in a document store (CouchDB, MongoDB, even potentially flat files), while the graph database would allow arbitrary navigation of the corpus. This certainly looks much more promising than attempting to do this via clunky joins in a relational database, or a complex series of views in CouchDB.

The graph database could also store message previews. When a user displays a message list, they would simply be walking the message graph. When they click on a message to see it completely, it would be retrieved from the document store.

That's the idea at least - it certainly remains to be tested and proven.

Social Network Analysis: When, How, and Why Harry Met Sally

marc smith nodexl [Image credit: Marc Smith]

One of the most interesting segments of email analytics is social graphing—that is, mapping out the relationships of a given inbox. You can do this as a simple one:one tie, but it is more interesting and insightful if you weight the ties according to any number of criteria (number of responses, time between responses, tone of content, number of ties to others in your network etc). In the course of researching what these criteria might be, I’ve come across a bunch of very cool papers on social network analysis that I thought I’d share with you. For a quick overview on SNA, I recommend reading Valdis Krebs’s introduction.

1)      Reputation Network Analysis for Email Filtering (from KDNuggets)

  • By: Jennifer Golbeck and James Hendler
  • Gist: Golbeck and Hendler take an inverse approach to spam: they highlight good messages and display the numerical reputation of their senders. This reputation score is inferred based on reputation scores users have manually entered for people they know.
  • Money quote: “The goal of this scoring system is not to give low ratings to bad senders, thus showing low numbers next to spam messages in the inbox. The main premise is to provide  higher ratings to non-spam senders, so users are able to identify messages of interest that they might not otherwise have recognized. This puts a lower burden on the user, since there is no need to rate all of the spam senders.”

3)    Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control

  • By: Marcel Salathé
  • Gist: One of the more compelling usages of SNA is as an epidemic/pandemic forecaster, the theory being that the communication patterns of nodes in a given network can mirror the physical contact patterns. Here, Salathe looks at twitter data from 101,853 users and assesses their H1N1 vaccination sentiment over time. He found that a) positive and negative sentiments form clusters, and b) there is a positive correlation between a cluster’s negative vaccine sentiments and its likelihood of disease outbreaks. Whether this is due to causation or homophily, I’m not sure.
  • Money quote: “We find that projected vaccination rates based on sentiments expressed on Twitter are in very good agreement with vaccination rates estimated by the CDC with traditional phone surveys.”

2)      Measuring Tie-Strength in Virtual Social Networks

  • By: Andrea Petroczi
  • Gist: This paper gives some good background on computer-mediated social networks and tie-strength, and gives a methodology for determining the latter based on the VTS-scale, which measures acquaintance and friendship among members of a given virtual community.
  • Money quote: “Both offline and on-line social networks can be described by 1) their participants, 2) the content, direction, and strength of their relations and ties, 3) their composition, derived from the social attributes of the participants, and 4), their complexity, which indicates the number of relations in a tie.”

4)      Analyzing Social Media Networks with NodeXL

  • By: Marc Smith
  • Gist: NodeXL is an add-in to Excel that allows users to visualize social networks. (To see it in action, check out Smith’s crowd-sourced Flickr gallery). This paper demonstrates how to use it on a given social media data set (in this case, an enterprise intranet social network). Those less pressed for time might want to check out the book version, co-authored by Derek Hansen and Ben Schneiderman.

5)      Semantic Social Network Analysis

  • By: Guillaume Erétéo
  • Gist: Users of so called “enterprise 2.0” platforms often form heterogenous social networks, and in this thesis, Ereteo proposes a way to analyze these networks (for the purpose of creating project teams, identifying experts, fostering communication etc) using the Semantic SNA Framework (SEMSNA) and semantic community detection and controlled labeling (SEMTAGp).
  • Money quote: “The ‘optimal partition’, imposed by mathematics, does not necessarily capture the actual community structure of the network.”