Setting up a reproducible data analysis workflow in R
Zip of files referred to in this walkthrough
Find yourself repeating the same tasks over and over again?
Example:
Save your most common tasks as a function
Open rmd/04-function_example.Rmd
and run the chunks when spotted
Data with counts by state
sb <- read.csv("https://docs.google.com/spreadsheets/d/1gH6eUQVQsEmFagy0qzQDEuwb3cutMWaddCLc7ESbzjc/pub?gid=294374511&single=true&output=csv", stringsAsFactors=F)
# sb <- read.csv("sb.csv", stringsAsFactors=F)
What would be a problem with this data set if we tried to map it?
That’s right, it’s raw data that hasn’t be adjusted for population yet.
This is a problem that we have to deal with all the time.
Let’s make it easier on ourselves in the future by dealing with it now.
(You can bring in a Google Sheet if you publish as a CSV and copy the link over)
pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)
# pop <- read.csv("pop.csv", stringsAsFactors=F)
Which ones match?
This is what it looks like when joined.
library(tidyverse)
sb_adjusted <- left_join(sb, pop, by=c("State_Abbreviation"="Abbrev"))
library(knitr)
kable(head(sb_adjusted, 3))
State_Abbreviation | Starbucks | State | Population |
---|---|---|---|
AK | 42 | Alaska | 741894 |
AL | 65 | Alabama | 4863300 |
AR | 37 | Arkansas | 6931071 |
And how it looks after some math.
sb_adjusted$per_capita <- sb_adjusted$Starbucks/sb_adjusted$Population*100000
kable(head(sb_adjusted, 3))
State_Abbreviation | Starbucks | State | Population | per_capita |
---|---|---|---|---|
AK | 42 | Alaska | 741894 | 5.661186 |
AL | 65 | Alabama | 4863300 | 1.336541 |
AR | 37 | Arkansas | 6931071 | 0.533828 |
First, establish some rules
All data sets you want to join with the population data set:
Now, rewrite the code so it’s more generalized. As in, it can deal with any data set you give it.
# Save the dataframe as a consistent name
any_df <- sb
# Rename the first column to "Abbrev"
colnames(any_df)[1] <- "Abbrev"
# Join by the similar name
df_adjusted <- left_join(any_df, pop, by="Abbrev")
# Do the calculations based on the values in the second column
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 100000
kable(head(df_adjusted, 3))
Abbrev | Starbucks | State | Population | per_capita |
---|---|---|---|---|
AK | 42 | Alaska | 741894 | 5.661186 |
AL | 65 | Alabama | 4863300 | 1.336541 |
AR | 37 | Arkansas | 6931071 | 0.533828 |
Turn your lines of code into a function by wrapping it with
function(arg1, arg2, ... ){
and }
Remember how there were two types of State ID data?
Full name and abbreviations.
We can write the function so you can tell it to join based on what type it should join by.
pc_adjust <- function(any_df, state_type){
pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)
# pop <- read.csv("pop.csv", stringsAsFactors=F)
# State type options are either "Abbrev" or "State"
colnames(any_df)[1] <- state_type
df_adjusted <- left_join(any_df, pop, by=state_type)
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 1000000
return(df_adjusted)
}
Test it out
Start out with sb
dataframe.
kable(head(sb, 3))
State_Abbreviation | Starbucks |
---|---|
AK | 42 |
AL | 65 |
AR | 37 |
Apply the function pc_adjust
to sb
with the variable Abbrev
.
test <- pc_adjust(sb, "Abbrev")
kable(head(test, 3))
Abbrev | Starbucks | State | Population | per_capita |
---|---|---|---|---|
AK | 42 | Alaska | 741894 | 56.61186 |
AL | 65 | Alabama | 4863300 | 13.36541 |
AR | 37 | Arkansas | 6931071 | 5.33828 |
Success!
Alright, we’ve got it working with Starbucks data.
Let’s try it with Dunkin’ Donuts data.
dd <- read.csv("https://docs.google.com/spreadsheets/d/1TWuWZpfDUMWmMpc7aPqUQ-g1a1J0rUO8_cle_zcPyI8/pub?gid=1983903926&single=true&output=csv", stringsAsFactors=F)
# dd <- read.csv("dd.csv", stringsAsFactors=F)
kable(head(dd))
State | Dunkin |
---|---|
Alabama | 18 |
Alaska | 0 |
Arizona | 59 |
Arkansas | 7 |
California | 2 |
Colorado | 8 |
The state identification is spelled out this time and not abbreviated.
Fortunately, we accounted for that when making the formula.
Run this code.
dd_adjusted <- pc_adjust(dd, "State")
kable(head(dd_adjusted))
State | Dunkin | Abbrev | Population | per_capita |
---|---|---|---|---|
Alabama | 18 | AL | 4863300 | 3.7011905 |
Alaska | 0 | AK | 741894 | 0.0000000 |
Arizona | 59 | AZ | 2988248 | 19.7440105 |
Arkansas | 7 | AR | 6931071 | 1.0099449 |
California | 2 | CA | 39250017 | 0.0509554 |
Colorado | 8 | CO | 5540545 | 1.4439013 |
Yay, we did it!
pc_adjust() is your tiny perfect function
Keep going!
For everyone to use.
Select File > New Project > New Directory > R Package
One word. Some tips on figuring out the best name.
Three components
R/
folder where you save your function code - more detailsDESCRIPTION
file for package metadata - more detailsNAMESPACE
file, which is only necessary if you’re submitting to CRAN - more detailsWelcome script
Questions about which License to use? Check out the options.
Also, notice that I added Imports: dplyr
because this function won’t work without the left_join
function from dplyr.
Copy and paste the pc_adjust
function you made into a new script file.
pc_adjust <- function(any_df, state_type){
pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)
# pop <- read.csv("pop.csv", stringsAsFactors=F)
# State type options are either "Abbrev" or "State"
colnames(any_df)[1] <- state_type
df_adjusted <- left_join(any_df, pop, by=state_type)
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 1000000
return(df_adjusted)
}
Name the file after the function, pc_adjust
and save it into the R/
folder
To your script
Go back to your pc_adjust.R
script and add these lines above the code.
#' Population adjuster
#'
#' This function appends state population data
#' @param any_df The name of the dataframe you want to append to
#' @param state_type if state identification is abbreviations, use "Abbrev" if full state name, use "State"
#' @keywords per capita
#' @import dplyr
#' @export
#' @examples
#' pc_adjust(dataframe, "Abbrev")
What is all that gibberish?
These special comments above the function will be compiled into the correct format.
Watch.
Run these lines in console.
install.packages("roxygen2")
library(roxygen2)
roxygenise()
It wrote to the NAMESPACE
file and created a pc_adjust.Rd
file based on the special comments.
Find and open pc_adjust.Rd in the man folder.
This would’ve been tough to put together by hand.
Press Cmd + Shift + B
to build the package.
Now, you have your package forever and ever.
Just run
install.packages("whateveryoucalledyourpackage")
and you can run pc_adjust
whenever you want.
Type
?pc_adjust
This is what your special comments above your R function helped generate.
Hold up.
Let others download and use your R package.
Here’s how the easy way and here’s the official way.
This means you have to add some clean documentation, such as a readme.MD
file.
install.packages("devtools")
library(devtools)
install_github("andrewbtran/abtnicarr")
library(abtnicarr)
From Giora Simchoni:
Keep adding functions to your package.
Perhaps, create a Shiny version of it for those who don’t use R.
Over time you’ll build up a bunch that you’ll rely on over and over again.
If it’s awesome, submit it to CRAN.
This was an extremely simple version of making a package.
For better details, check out the free book from Hadley Wickham.