Setting up a reproducible data analysis workflow in R
Zip of files referred to in this walkthrough
The goal of reproducible data analysis is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.
For journalists
These are all things I picked up from browsing other presentations and repos.
Much thanks to Jenny Bryan and Joris Muller from whom I cobbled many of these ideas and practices from.
Also to BuzzFeed, FiveThirtyEight, ProPublica, Chicago Tribune, Los Angeles Times, and TrendCT.org
Why a clear data analysis workflow
DO NOT USE setwd()
here
package.library(here)
#> here() starts at /Users/IRE/Projects/NICAR/2018/workflow
here()
#> [1] "/Users/IRE/Projects/NICAR/2018/workflow"
here("Test", "Folder", "text.txt")
#> [1] "/Users/IRE/Projects/NICAR/2018/workflow/Test/Folder/test.txt"
cat(readLines(here("Test", "Folder", "text.txt")))
#> You found the text file nested in these subdirectories!
name_of_project
|--data
|--2017report.csv
|--2016report.pdf
|--summary2016_2017.csv
|--docs
|--01-analysis.Rmd
|--01-analysis.html
|--scripts
|--exploratory_analysis.R
|--name_of_project.Rproj
|--run_all.R
name_of_project
|--raw_data
|--WhateverData.xlsx
|--2017report.csv
|--2016report.pdf
|--output_data
|--summary2016_2017.csv
|--rmd
|--01-analysis.Rmd
|--docs
|--01-analysis.html
|--01-analysis.pdf
|--02-deeper.html
|--02-deeper.pdf
|--scripts
|--exploratory_analysis.R
|--pdf_scraper.R
|--name_of_project.Rproj
|--run_all.R
folder_names <- c("raw_data", "output_data", "rmd", "docs", "scripts")
sapply(folder_names, dir.create)
if (!file.exists("data/bostonpayroll2013.csv")) {
dir.create("data", showWarnings = F)
download.file(
"https://website.com/data/bostonpayroll2013.csv",
"data/bostonpayroll2013.csv")
}
payroll <- read_csv("data/bostonpayroll2013.csv")
if (!file.exists("data/employment/2016-12/FACTDATA_DEC2016.TXT")) {
dir.create("data", showWarnings = F)
temp <- tempfile()
download.file(
"https://website.com/data/bostonpayroll2013.zip",
temp)
unzip(temp, exdir="data", overwrite=T)
unlink(temp)
}
payroll <- read_csv("data/bostonpayroll2013.csv")
Never save workspace to .RData on exiting RStudio and uncheck Restore .RData on startup.
This will make sure you’ve optimized your data ingesting and cleaning process and aren’t working with a misstep in your process.
pkgs <- c('reshape2','geojson','readxl','ggplot2', 'leaflet','httr','rgeolocate','shiny','sp','dplyr', 'widyr', 'slickR', 'ggraph', 'svglite', 'geojsonio')
check <- sapply(pkgs,require,warn.conflicts = TRUE,character.only = TRUE)
if(any(!check)){
pkgs.missing <- pkgs[!check]
install.packages(pkgs.missing)
check <- sapply(pkgs.missing,require,warn.conflicts = TRUE,character.only = TRUE)
}
The packrat
package creates a snapshot of the versions you are using in a workspace.
library(packrat)
init()
status()
restore()