Note 1: The real title of this post is something like MRRRGGHSTUPIDSTUPIDWHYWHYUCRASSSSHURGGH. The subtitle is "How are Scanned PDFs a thing?"
Actually, I know the answer to that: scanned pdfs are a thing because they are super amazing at obfuscating data in plain sight. Which is why whenever a corporation or agency or politician is required by law to surrender its/his/her emails, they oblige by giving the requirer a Mall of America-sized dump of paper, which the poor requirer must then scan and begin some variation of the painful process I am about to detail.
Or, if you're the Guardian, you bunk the painful process and ask your readers to help you manually sort through the emails. But alas, I am not the Guardian, and my mom can only read so fast.
So! Let's begin, shall we?
Step 1: Download a free trial of Wondershare PDF Editor Pro.
You'll need this to turn your scanned pdf into a text pdf. Well, you won't need this if you have Acrobat, but if you don't have Acrobat, and don't plan on doing this on the reg, use the Wondershare free trial. If you are planning on doing this on the reg, you should probably get Acrobat, because Wondershare isn't all that great. Alternatively, if you are a developer and are the patient sort, try to muddle through the Tabula installation.
Step 2: Perform OCR on the scanned PDF.
To do this, open Wondershare and then select your pdf. Wondershare will ask you if you want to convert your pdf from image to text, which, duh. Then you have to wait. If your pdf is 1000 pages and you work off a macbook, you're going to have to wait for 3-4 hours, if you're lucky. If you're not lucky, Wondershare will crash at around 2:45 hours in. So my recommendation is to split that puppy into 200 page junks or smaller. But eventually, you should have your text pdf. Wahoo!
Step 3: Turn your PDF into a TXT file.
For this, you'll need Adobe Reader. Unlike Acrobat, Reader is free. Get it. Then open your PDF, and save it as TXT.
Step 4: Turn your TXT into a CSV.
In order to analyze your data in Excel or Fusion or what have you, you'll want it in tabular format. So open up your TXT in Text Edit or the Windows equivalent and change the extension to CSV.
Step 5: Clean up your CSV with Ruby.
Chances are, if you're dealing with a scanned PDF, your data is going to be a big ole mess. Here enters the only code of the post. You can tweak it to suit your headings and separators. Basically, it splits your CSV by "from," and then splits everything into key-value pairs like 'header => value'. These are used to create the email objects, which then get printed out with missing fields as appropriate. Ultimately, everything ends up as one line per email, with each field's value separated by commas.
Alors, here is the the script. Put it in your text editor and save it as palinparser.rb or whatever you like. Make sure you check where you're saving it to, as that's where your finished spreadsheet will end up. The only things you'll need to change are the headings (unless you're also analyzing an email set) and then the original file name.
@from = ''
@sent = ''
@to = ''
@cc = ''
@subject = ''
@body = ''
email.split(',').each do |part|
pieces = part.split(':', 2)
@from = pieces
@sent = pieces
@to = pieces
@cc = pieces
@subject = pieces
@body = pieces
return @from + ',' + @sent + ',' + @to + ',' + @cc + ',' + @subject + ',' + @body
doc = File.open('YOURCURRENTFILENAME.csv');
out = File.new('YOURNEWFILENAME.csv', 'w')
out << "from, sent, to, cc, subject, body\n"
doc.each_line do |line|
line.split('From:').each do |email|
if ! email.empty?
pem = PalinEmail.new(email)
out << pem.to_s + "\n"
Then, open up your terminal (Utlities-->Terminal). Check what directory you're in by typing "pwd." Change this to the directory where you've stored palinparser.rb by typing "cd [+ path to directory]". All set? Now, assuming you have Ruby, type "ruby load 'palinparser.rb' " into your terminal.
Et voila, you should have your nice neat spreadsheet.
Step 6: Analyze
Along with basic Excel analysis like send frequencies over time:
You can also use Google Fusion Tables to make network graphs. If you've never used it before, details for doing so are here. All you need is a Google account. This one shows the relationship between sender and first recipient.
The other thing I like using are word clouds, which can give you a quick insight into the most used words in a body of text. I used Stanford's Wordsift tool to make this is the word cloud of the bodies of the SP emails.
That's all I have for today. Hope it's useful! If you have any questions, hit me up in the comments or on Twitter!