Andrew Tran
September 30, 2015
Now all you have to do is the reporting.
If only it was that easy.
In 2012, the U.S. vice-presidential candidate Paul Ryan was campaigning on an issue that government stimulus money was wasteful. An AP data journalist wondered if Ryan had previously accepted the same money from the government stimulus package.
He sent a FOIA request to 300+ goverment agencies and got back stacks of paper.
Over the next seven weeks, the stack of pages the government sent to Gillum steadily grew taller on the corner of his desk—to 12 inches, then to 2 feet and higher. For each file, he scanned the pages electronically and uploaded them to the AP’s internal “APDocs” DocumentCloud server.
Step 1: Get them off the paper.
Get them scanned so you can search, analyze, collaborate, publish, or really do anything at all with your documents other than read them alone in your room.
According to a survey, journalists begin reporting with paper documents (as opposed to digital document files) about half the time. A lot of this comes from governments, who usually respond to Freedom of Information requests with paper.
Why?
Paper is still popular among government agencies.
Redacting documents electronically is not trusted enough.
If you have hundreds of pages of documents, then a single scanner won't do.
Most copy machines have a sheet feeder.
Use one in school or at a Kinko's.
Most copy machines also are connected to the internet and can send PDFs of scanned documents to an email address.
The reporter scanned the documents as they came in, using his office copy machine.
He ended up with almost 9000 pages which he analyzed in software that helped add metadata to the PDF documents. The story: Ryan asked for federal help as he championed cuts
Republican vice presidential candidate Paul Ryan is a fiscal conservative, champion of small government and critic of federal handouts. But as a congressman in Wisconsin, Ryan lobbied for tens of millions of dollars on behalf of his constituents for the kinds of largess he’s now campaigning against, according to an Associated Press review of 8,900 pages of correspondence between Ryan’s office and more than 70 executive branch agencies.
When you scan a file, a computer doesn't recognize it as text.
It considers it an image.
Scanned documents are merely images of text.
Text file formats:
Image file formats:
How to test
Open the PDF and try to:
If successful, it's text. Congrats.
If it's not, it's an image. Sorry.
OCR stands for Optical Character Recognition
Tools:
Try copying and pasting the table data into Excel and Google Spreadsheets. (Sometimes each program behaves differently)
Try opening the PDF in your browser like Chrome instead of Adobe or Preview (Sometimes each program behaves differently) and then copying and pasting again.
Tools: