Switching from Excel to R for data analysis can seem daunting. Over time, the open-source statistical programming language has consistently grown in popularity among those who work with numbers, with thousands of user-created libraries to expand on its power.
Though it was first created primarily to make it easier to create statistical models and output very basic visuals to explore data, it's expanded to the point that people can use R to do a multitude of advanced processes such as scrape websites, communicate with APIs, and publish beautiful interactive charts and maps.
All with just a few lines of code.
The practice of "Reproducible Research" has been spreading outside the world of academia to other areas, like non-profits and journalism. It's the idea that analyses should be published with the original data, as well as the methodology or software code so that others can verify or build on them.
Others can reproduce your practice, but the big draw is that for other projects, so can you.
In Excel, a user might run a formula, do some sorting, create a couple of pivot tables— but the work you did on that data cannot be replicated quickly on another spreadsheet with a similar data structure. The whole process has to be repeated step by step.
The point of doing data analysis in R is that a user can write a script to slice up and analyze a spreadsheet, then it can be saved, and then brought back to be used on another spreadsheet with just a few tweaks in the code.
Easy.
This tutorial is based on a presentation I gave at the Boston University Storytelling with Data Workshop. It's meant to ease seasoned Excel users into the world of R. With GIFs.
We will be working out of RStudio, which is a powerful shell that runs on top of R.
Get acquainted with the different types of data structures within R.
Spreadsheets in Excel are usually recognized as Matrices and Dataframes. When working in R, it's important to understand how they're made up by individual vectors and lists.
Start by creating some arrays and assigning them to variables using the arrow. The c() command stands for combine.
Type in sports in the console and this is what you get: A dataframe made up of different types of arrays (String, Number, Boolean).
Note: You can type out or copy the code in this tutorial into RStudio's console (bottom left window) line by line but the point is to reproduce this in the future. That means you should put all your code in a script in the top left window. File > New File > R Script
First, you have to set the directory to where your files (like csvs) are.
Or, you can set it via the RStudio menu up top: Session > Set Working Directory > Choose Directory