Monday, March 23, 2015

Using importHTML to scrape websites

This is a quick post to show how you can scrape data from websites using Google Spreadsheets. (There are ways to do this in Excel as well).

Let's start with the page, this table of rushing stats on NFL.com. The fact that it is tabular data gives us hope we can scrape it.

Start a new spreadsheet in Google Docs. Go to the first cell and put in this formula:

=IMPORTHTML("http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=RUSHING&conference=null&season=2014&seasonType=POST&d-447263-s=RUSHING_YARDS&d-447263-o=2&d-447263-n=1","table",1)

The function you are calling is IMPORTHTML and it takes three arguments:

  • URL: This is the url you are pulling from. It needs to be in quotes.
  • QUERY: If you are looking for a table or a list. This refers to the HTML of the page. In our case, it is a table, so it is looking for the <table> tag in the code. This should be in quotes. "list" is your other option.
  • INDEX: This is which table on the page. We started with 1 and found it, but some pages might have more than one table on the page. If you put in the number 2 it would look for the second table on the page.
What you get in return is all the tabbed material getting pulled into your page. It is tied to the formula, so if you want to keep it or don't want it to change, copy the content and paste special as values.


Not every page will work, but it is simple enough to try. There are plenty of other discussions on the web about this function, so you can search if you have trouble.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.