JRN-418

Data Journalism at CCSU, Fall 2015

Scraping data without programming using Import.io

Download the app import.io.

Start it up and click on New.

You've got a lot of options.

Go ahead and click on Magic.

This is a list of previously constructed formulas to scrape some popular sites.

Click on the Zillow button.

This spreadsheet was taken from the Zillow website.

Do you see the similarities to the data from the original website?

Click on Download CSV at the bottom right.

You can select how many pages of data to download.

This is what the spreadsheet looks like. It's rough and needs cleaning up but now you can do a bunch of data analysis on it.

What do you think you can do with this type of data?

Scraping from a website

Click New > Extractor

Let's take a look at AirBNB data

Type in airbnb.com into the URL bar at the top of the app.

Search for Hartford, CT.

Click the ON button at the top left.

The page will reload and give you an alert that Javascript needs to be enabled.

Click the JS button on the top right to turn on Javascript.

Are you sure? Yes.

We need to train the import.io app to recognize the data and sort them into a data frame.

Hover over the data you want to train.

Hover over the $90 until it turns orange.

Click it.

We want to turn each of these listings into a row of a spreadsheet.

Select Many rows.

This is also a link, would you like to add this too?

Yup.

We're going to need it later.

We can continue to click on +New Column and hovering and clicking for every facet that we want...

Or we can just click Suggest Data and it will guess what we want.

Go ahead and click it.

OK, that's a lot of interesting data.

It's pretty much what we want.

Click Done at the top right.

You've created an API.

Name it and click Publish.

You'll be brought to your import.io library.

The problem here is that you've only got 1 page of data.

That means 18 results from the page. There might be hundreds more.

Fortunately, there's an option for Bulk Extract.

Give it a list of URLs with similar structure as the original one and it will pull out the data.

We need to get the links for the other pages.

Right click on the 2 link at the bottom and copy the link.

The link looks like https://www.airbnb.com/s/Hartford--CT--United-States?page=2.

And it it appears there are 13 pages total.

You could paste this link into the import.io box 13 times and just change the number from 2 to 3 to 4, etc.

But here's a quicker way.

Open up Google Spreadsheets.

The only thing that changes from page to page in the URL is the number at the very end.

So let's make a formula that let's us change it easily.

Delete the 2.

Type it into the column to the right.

You're separating it.

Drag down the URL so that it copies down for 13 rows.

Put in the numbers so it counts incrementally.

Now, let's use a formula in the next column to bring those two together.

Drag the formula down so it fills out.

Now you have a list of URLs.

Copy that column over to the import.io app and place it in the Bulk Extract box.

Click Run Query

It will tell you that 14/14 URLs have been processed (1 URL failed because it looked for page 14 when it only goes up to 13)

Click the Export button at the top right when you're done.

Now you have a new data set to analyze.

The data set also includes links to specific pages for each AirBNB listing that will include more specific information like amenities and neighborhood.

Might be useful to scrape if you want to do a deeper dive into AirBNB data...