What to do with the docs

Andrew Tran
September 30, 2015

Homework for next week

  • Answer questions from this week's readings
  • Extract data from a PDF, clean it up and save as a spreadsheet
  • Critique a piece of data journalism

Notes about schedule

  • No class on October 21
  • However, your midterm will be due on October 21
  • Dates for portions of the Final Project are on the schedule
    • Structured like the story process of a newsroom
    • Pitch, revisions, repitch, storyboarding, draft, editing, final copy, last bits of copy-editing and visualization notes, then published.
  • November 11 - A visit from Matt Carroll and a screening of Spotlight at 6 or 6:30 p.m. There will be homework based on the visit.

What's wrong with this data visualization?

Success?

  • Your Freedom of Information request finally yielded a big brown envelope in the mail.
  • You are the lucky recipient of a juicy leak.
  • You’ve managed to scrape all the PDFs from that stone-age government portal.

Now all you have to do is the reporting.

Nope.

If only it was that easy.

  • Your next steps depend on what you’ve got and what you’re trying to do.
  • You might have one page or one million pages.
  • You could be starting with a tall stack of paper or a CSV file or anything in between.
  • Maybe you already know exactly what you’re looking for, or maybe that anonymous tip was maddeningly non-specific.

Are the documents on paper?

In 2012, the U.S. vice-presidential candidate Paul Ryan was campaigning on an issue that government stimulus money was wasteful. An AP data journalist wondered if Ryan had previously accepted the same money from the government stimulus package.

He sent a FOIA request to 300+ goverment agencies and got back stacks of paper.

Over the next seven weeks, the stack of pages the government sent to Gillum steadily grew taller on the corner of his desk—to 12 inches, then to 2 feet and higher. For each file, he scanned the pages electronically and uploaded them to the AP’s internal “APDocs” DocumentCloud server.

Paper documents

Step 1: Get them off the paper.

Get them scanned so you can search, analyze, collaborate, publish, or really do anything at all with your documents other than read them alone in your room.

Paper documents

According to a survey, journalists begin reporting with paper documents (as opposed to digital document files) about half the time. A lot of this comes from governments, who usually respond to Freedom of Information requests with paper.

Why?

Paper is still popular among government agencies.

Redacting documents electronically is not trusted enough.

Lots of paper documents

If you have hundreds of pages of documents, then a single scanner won't do.

Most copy machines have a sheet feeder.

Use one in school or at a Kinko's.

Most copy machines also are connected to the internet and can send PDFs of scanned documents to an email address.

The AP and Paul Ryan

The reporter scanned the documents as they came in, using his office copy machine.

He ended up with almost 9000 pages which he analyzed in software that helped add metadata to the PDF documents. The story: Ryan asked for federal help as he championed cuts

Republican vice presidential candidate Paul Ryan is a fiscal conservative, champion of small government and critic of federal handouts. But as a congressman in Wisconsin, Ryan lobbied for tens of millions of dollars on behalf of his constituents for the kinds of largess he’s now campaigning against, according to an Associated Press review of 8,900 pages of correspondence between Ryan’s office and more than 70 executive branch agencies.

Text Files or Image Files?

When you scan a file, a computer doesn't recognize it as text.

It considers it an image.

Scanned documents are merely images of text.

Text file formats:

  • Word files, TXT, RTF, HTML

Image file formats:

  • JPEG, TIFF, PNG, GIF

A PDF can be either a text or image file

How to test

Open the PDF and try to:

  • Search (ctrl+f)
  • Select
  • Copy and paste snippets

If successful, it's text. Congrats.

If it's not, it's an image. Sorry.

Convert the image to text with OCR

OCR stands for Optical Character Recognition

Tools:

What if you just want the numbers?

Try copying and pasting the table data into Excel and Google Spreadsheets. (Sometimes each program behaves differently)

Try opening the PDF in your browser like Chrome instead of Adobe or Preview (Sometimes each program behaves differently) and then copying and pasting again.

Tools:

Are the numbers on a website?