Creating functions and packages [link]
Intro to R Markdown [link]

Creating functions and packages

Sisyphus

Find yourself repeating the same tasks over and over again?

Example:

Analyzing data that’s published regularly
Always replacing that one county name so it matches a lookup file
Adjusting summary state data for population.

Reproducibility

Save your most common tasks as a function

Introducing the example

Data with counts by state

sb <- read.csv("https://docs.google.com/spreadsheets/d/1gH6eUQVQsEmFagy0qzQDEuwb3cutMWaddCLc7ESbzjc/pub?gid=294374511&single=true&output=csv", stringsAsFactors=F)

# sb <- read.csv("sb.csv", stringsAsFactors=F)

What would be a problem with this data set if we tried to map it?

That’s right, it’s raw data that hasn’t be adjusted for population yet.

This is a problem that we have to deal with all the time.

Let’s make it easier on ourselves in the future by dealing with it now.

Start with a look up file

(You can bring in a Google Sheet if you publish as a CSV and copy the link over)

pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)

# pop <- read.csv("pop.csv", stringsAsFactors=F)

Figure out the columns to join by

Which ones match?

This is what it looks like when joined.

library(tidyverse)
sb_adjusted <- left_join(sb, pop, by=c("State_Abbreviation"="Abbrev"))

library(knitr)
kable(head(sb_adjusted, 3))

State_Abbreviation	Starbucks	State	Population
AK	42	Alaska	741894
AL	65	Alabama	4863300
AR	37	Arkansas	6931071

And how it looks after some math.

sb_adjusted$per_capita <- sb_adjusted$Starbucks/sb_adjusted$Population*100000
kable(head(sb_adjusted, 3))

State_Abbreviation	Starbucks	State	Population	per_capita
AK	42	Alaska	741894	5.661186
AL	65	Alabama	4863300	1.336541
AR	37	Arkansas	6931071	0.533828

Turn your code into a function

Adjust the code for general data sets

First, establish some rules

All data sets you want to join with the population data set:

First column will contain either the full state name or the state abbreviation
Second column will have values you want to adjust for population

Now, rewrite the code so it’s more generalized. As in, it can deal with any data set you give it.

# Save the dataframe as a consistent name
any_df <- sb

# Rename the first column to "Abbrev"
colnames(any_df)[1] <- "Abbrev"

# Join by the similar name
df_adjusted <- left_join(any_df, pop, by="Abbrev")

# Do the calculations based on the values in the second column
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 100000

kable(head(df_adjusted, 3))

Abbrev	Starbucks	State	Population	per_capita
AK	42	Alaska	741894	5.661186
AL	65	Alabama	4863300	1.336541
AR	37	Arkansas	6931071	0.533828

Function and variables

Turn your lines of code into a function by wrapping it with

function(arg1, arg2, ... ){ and }

Remember how there were two types of State ID data?

Full name and abbreviations.

We can write the function so you can tell it to join based on what type it should join by.

pc_adjust <- function(any_df, state_type){
  pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)
  # pop <- read.csv("pop.csv", stringsAsFactors=F)
  # State type options are either "Abbrev" or "State"
colnames(any_df)[1] <- state_type
df_adjusted <- left_join(any_df, pop, by=state_type)
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 1000000
return(df_adjusted)
}

Test it out

Start out with sb dataframe.

kable(head(sb, 3))

State_Abbreviation	Starbucks
AK	42
AL	65
AR	37

Apply the function pc_adjust to sb with the variable Abbrev.

test <- pc_adjust(sb, "Abbrev")
kable(head(test, 3))

Abbrev	Starbucks	State	Population	per_capita
AK	42	Alaska	741894	56.61186
AL	65	Alabama	4863300	13.36541
AR	37	Arkansas	6931071	5.33828

Success!

Test it on another data set

Alright, we’ve got it working with Starbucks data.

Let’s try it with Dunkin’ Donuts data.

dd <- read.csv("https://docs.google.com/spreadsheets/d/1TWuWZpfDUMWmMpc7aPqUQ-g1a1J0rUO8_cle_zcPyI8/pub?gid=1983903926&single=true&output=csv", stringsAsFactors=F)

# dd <- read.csv("dd.csv", stringsAsFactors=F)

kable(head(dd))

State	Dunkin
Alabama	18
Alaska	0
Arizona	59
Arkansas	7
California	2
Colorado	8

The state identification is spelled out this time and not abbreviated.

Fortunately, we accounted for that when making the formula.

Run this code.

dd_adjusted <- pc_adjust(dd, "State")

kable(head(dd_adjusted))

State	Dunkin	Abbrev	Population	per_capita
Alabama	18	AL	4863300	3.7011905
Alaska	0	AK	741894	0.0000000
Arizona	59	AZ	2988248	19.7440105
Arkansas	7	AR	6931071	1.0099449
California	2	CA	39250017	0.0509554
Colorado	8	CO	5540545	1.4439013

Yay, we did it!

pc_adjust() is your tiny perfect function

Keep going!

Turn your function into a package

For everyone to use.

Select File > New Project > New Directory > R Package

Name the package

One word. Some tips on figuring out the best name.

Three components

An R/ folder where you save your function code - more details
A basic DESCRIPTION file for package metadata - more details
A basic NAMESPACE file, which is only necessary if you’re submitting to CRAN - more details

Welcome script

Edit the DESCRIPTION file

Questions about which License to use? Check out the options.

Also, notice that I added Imports: dplyr because this function won’t work without the left_join function from dplyr.

Create a new script

Copy and paste the pc_adjust function you made into a new script file.

pc_adjust <- function(any_df, state_type){
  pop <- read.csv("https://docs.google.com/spreadsheets/d/16oW_uvRJCNoOnCeAkJH4fDouFokjaGUdGFUCaFdKd6I/pub?output=csv", stringsAsFactors=F)
  # pop <- read.csv("pop.csv", stringsAsFactors=F)
  # State type options are either "Abbrev" or "State"
colnames(any_df)[1] <- state_type
df_adjusted <- left_join(any_df, pop, by=state_type)
df_adjusted$per_capita <- df_adjusted[,2] / df_adjusted$Population * 1000000
return(df_adjusted)
}

Save as new script in the R folder

Name the file after the function, pc_adjust and save it into the R/ folder

Add documentation

To your script

Go back to your pc_adjust.R script and add these lines above the code.

#' Population adjuster
#'
#' This function appends state population data
#' @param any_df The name of the dataframe you want to append to
#' @param state_type if state identification is abbreviations, use "Abbrev" if full state name, use "State"
#' @keywords per capita
#' @import dplyr
#' @export
#' @examples
#' pc_adjust(dataframe, "Abbrev")

What is all that gibberish?

These special comments above the function will be compiled into the correct format.

Watch.

Run these lines in console.

install.packages("roxygen2")
library(roxygen2)
roxygenise()

It wrote to the NAMESPACE file and created a pc_adjust.Rd file based on the special comments.

Find and open pc_adjust.Rd in the man folder.

This would’ve been tough to put together by hand.

Build your package

Press Cmd + Shift + B to build the package.

Now, you have your package forever and ever.

Just run

install.packages("whateveryoucalledyourpackage")

and you can run pc_adjust whenever you want.

Your help file

Type

?pc_adjust

This is what your special comments above your R function helped generate.

Hold up.

Upload package folder to Github

Let others download and use your R package.

Here’s how the easy way and here’s the official way.

This means you have to add some clean documentation, such as a readme.MD file.

install.packages("devtools")
library(devtools)

install_github("andrewbtran/abtnicarr")
library(abtnicarr)

Next steps

Keep adding functions to your package.

Perhaps, create a Shiny version of it for those who don’t use R.

Over time you’ll build up a bunch that you’ll rely on over and over again.

If it’s awesome, submit it to CRAN.

This was an extremely simple version of making a package.

For better details, check out the free book from Hadley Wikham.

Reproducibility and R Markdown

Andrew