JRN-418

Data Journalism at CCSU, Fall 2015

Introduction to R

Switching from Excel to R for data analysis can seem daunting. Over time, the open-source statistical programming language has consistently grown in popularity among those who work with numbers, with thousands of user-created libraries to expand on its power.

Though it was first created primarily to make it easier to create statistical models and output very basic visuals to explore data, it's expanded to the point that people can use R to do a multitude of advanced processes such as scrape websites, communicate with APIs, and publish beautiful interactive charts and maps.

All with just a few lines of code.

The practice of "Reproducible Research" has been spreading outside the world of academia to other areas, like non-profits and journalism. It's the idea that analyses should be published with the original data, as well as the methodology or software code so that others can verify or build on them.

Others can reproduce your practice, but the big draw is that for other projects, so can you.

In Excel, a user might run a formula, do some sorting, create a couple of pivot tables— but the work you did on that data cannot be replicated quickly on another spreadsheet with a similar data structure. The whole process has to be repeated step by step.

The point of doing data analysis in R is that a user can write a script to slice up and analyze a spreadsheet, then it can be saved, and then brought back to be used on another spreadsheet with just a few tweaks in the code.

Difficulty level

Easy.

This tutorial is based on a presentation I gave at the Boston University Storytelling with Data Workshop. It's meant to ease seasoned Excel users into the world of R. With GIFs.

Setting up R
Navigating RStudio
Screen Shot 2015-06-11 at 7.54.34 PM

We will be working out of RStudio, which is a powerful shell that runs on top of R.

Data structures and variables in R
Screen Shot 2015-06-11 at 7.56.46 PM

Get acquainted with the different types of data structures within R.

Spreadsheets in Excel are usually recognized as Matrices and Dataframes. When working in R, it's important to understand how they're made up by individual vectors and lists.

Start by creating some arrays and assigning them to variables using the arrow. The c() command stands for combine.

Type in sports in the console and this is what you get: A dataframe made up of different types of arrays (String, Number, Boolean).

Screen Shot 2015-06-11 at 8.45.23 PM
Opening a file

Note: You can type out or copy the code in this tutorial into RStudio's console (bottom left window) line by line but the point is to reproduce this in the future. That means you should put all your code in a script in the top left window. File > New File > R Script

openfile

First, you have to set the directory to where your files (like csvs) are.

Or, you can set it via the RStudio menu up top: Session > Set Working Directory > Choose Directory

Looking up number of observations
totalrows
Focus on first five rows
top5
Checking structure of the data
structure
If you look closely at this column in Excel, you see the $ symbol disappears when you click into it. The raw data has no $ sign but only puts it there for cosmetic purposes. In R, it doesn't intuitively know that a $ means currency. That will cause errors when you try to run math on it because R won't let you perform calculations on strings. Here's how to check and see what kind of variables are in your dataframe.
Focusing on Columns or Rows or Cells
Sometimes you want to look at a single cell, column, or row. To do so, add brackets to the dataframe in the console, like earnings[1,3]. Think "Data RoCks." As in, Dataframe[Rownumber,Columnnumber]
Screen Shot 2015-06-11 at 9.13.41 PM
Most of the time, though, it's difficult to reference columns by number — especially when there's many, many columns. Here's an alternative: dataframe$NameOfColumn
Screen Shot 2015-06-11 at 9.18.45 PM
Changing format of a column
numericformat
Sort a data frame based on a column
sort
Creating a formula
formula
Filtering data by column content
filter
Calculations on columns
colcalc
Data to Columns in Excel
What if you want to split up a name into separate columns? It's pretty easy to do in Excel.
datatocolumns
It's a bit more complicated in R, but here's how. The gist of it is: Creating new columns by copying the original name column and deleting everything before or after a comma.
Pivot table for simple count
How many employees are there per department?
countper
Advanced pivot table
What's the sum of income for all employees in each department? Here's how to do it in R. Other options other than "sum" include "median" and "mean."
Check it and save file as a spreadsheet (CSV)
You could also save as a text file, if you want.
All set for now
Hopefully, this gets you on your way to working with R. If you wanted to run the same analysis on the second spreadsheet included in the files, you'd only have to adjust some of the column names, but you could run the entire thing and it would export you a new CSV in seconds. Some of the steps in R were more complicated to pull off than in Excel. There are actually some great libraries that are more efficient, but I wanted to show you how to execute the basics before we get into the shortcuts that people have put together in packages. The next R tutorial will expand on translating advanced Excel techniques to R.