Setting up a reproducible data analysis workflow in R

Zip of files referred to in this walkthrough

The goal of reproducible data analysis is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.

For journalists

  • Builds trust among readers
  • Enhances transparency
  • Simplifies peer review
  • Promotes community

Thanks

These are all things I picked up from browsing other presentations and repos.

Much thanks to Jenny Bryan and Joris Muller from whom I cobbled many of these ideas and practices from.

Also to BuzzFeed, FiveThirtyEight, ProPublica, Chicago Tribune, Los Angeles Times, and TrendCT.org

Purpose

Why a clear data analysis workflow

  • Check analysis and track errors (?)
  • Share results with colleagues for stories or editing
  • Send methodology to sources for bullet-proofing
  • To easily adjust when presented with new data
  • Easily switch between work environments (desktop and laptop)
  • Scavenge and repurpose code in future projects

Constraints

  • Workflow has to be platform agnostic
  • Easy to deploy for yourself and others
  • Free open source software
  • Input has to be real raw data in whatever format it is (and wherever it is)
  • But have a backup for when internet is not accessible
  • Output has to work – whether html, PDF, or web app
  • IDE agnostic (be able to run it from a command line without Rstudio)

Four components

  1. Software
  • R
  • Rstudio
  • Git for version control
  1. Clear file organization
  2. One R script to pull it all together
  3. Hosting the html output internally or publicly with Github pages

Use projects to organize

Do not dump your scripts into a folder

One folder per project

  • RStudio project
  • Git repo
  • Can run parallel projects

Use portable file paths

DO NOT USE setwd()

  • Try the here package.
library(here)
#> here() starts at /Users/IRE/Projects/NICAR/2018/workflow
here()
#> [1] "/Users/IRE/Projects/NICAR/2018/workflow"
here("Test", "Folder", "text.txt")
#> [1] "/Users/IRE/Projects/NICAR/2018/workflow/Test/Folder/test.txt"
cat(readLines(here("Test", "Folder", "text.txt")))
#> You found the text file nested in these subdirectories!

Files organization

At minimum

name_of_project
|--data
    |--2017report.csv
    |--2016report.pdf
    |--summary2016_2017.csv
|--docs
    |--01-analysis.Rmd
    |--01-analysis.html
|--scripts
    |--exploratory_analysis.R
|--name_of_project.Rproj
|--run_all.R

Optimal

name_of_project
|--raw_data
    |--WhateverData.xlsx
    |--2017report.csv
    |--2016report.pdf
|--output_data
    |--summary2016_2017.csv
|--rmd
    |--01-analysis.Rmd
|--docs
    |--01-analysis.html
    |--01-analysis.pdf
    |--02-deeper.html
    |--02-deeper.pdf
|--scripts
    |--exploratory_analysis.R
    |--pdf_scraper.R
|--name_of_project.Rproj
|--run_all.R

Creating folder shortcut

folder_names <- c("raw_data", "output_data", "rmd", "docs", "scripts")

sapply(folder_names, dir.create)

Organization principles

  • Directory names are obvious to anyone looking
  • Reports and the script files are not in the same directory
  • Reports are sorted using 2-digit numbers. Tell your story clearly.

Source to the online data

Normal data file

if (!file.exists("data/bostonpayroll2013.csv")) {

  dir.create("data", showWarnings = F)
  download.file(
  "https://website.com/data/bostonpayroll2013.csv",
  "data/bostonpayroll2013.csv")
}

payroll <- read_csv("data/bostonpayroll2013.csv")

Dealing with a zip file

if (!file.exists("data/employment/2016-12/FACTDATA_DEC2016.TXT")) {
  
  dir.create("data", showWarnings = F)
  temp <- tempfile()
  download.file(
  "https://website.com/data/bostonpayroll2013.zip",
  temp)
  unzip(temp, exdir="data", overwrite=T)
  unlink(temp)
}

payroll <- read_csv("data/bostonpayroll2013.csv")

Operate without a net

Never save workspace to .RData on exiting RStudio and uncheck Restore .RData on startup.

This will make sure you’ve optimized your data ingesting and cleaning process and aren’t working with a misstep in your process.

Packaging packages

pkgs <- c('reshape2','geojson','readxl','ggplot2', 'leaflet','httr','rgeolocate','shiny','sp','dplyr', 'widyr', 'slickR', 'ggraph', 'svglite', 'geojsonio')

check <- sapply(pkgs,require,warn.conflicts = TRUE,character.only = TRUE)
if(any(!check)){
    pkgs.missing <- pkgs[!check]
    install.packages(pkgs.missing)
    check <- sapply(pkgs.missing,require,warn.conflicts = TRUE,character.only = TRUE)
  }

Deprecated Functions

The packrat package creates a snapshot of the versions you are using in a workspace.

  • Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa.
  • Portable: Easily transport your projects from one computer to another, even across different platforms.
  • Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
library(packrat)
init()
status()
restore()