Monday, February 16, 2015

Data cleaning with Open Refine, Mr. People

Data Visualization, Spring 2015

Mr. People

Mr. People is an online tool to help you parse a list of names. We’re covering it first because it is quick and easy to show. It helps you take names that are all in one field and split them into first name, last name, etc. so you can better work with the data.
  • Go to Mr. People.
  • Use the example data there to parse into a Table.
  • Review the results. Note where it was able to parse and where it couldn’t.
  • If you parse to tab-delimited text, you can then copy ‘n’ paste it Excel.

More tools

Open Refine

Open Refine used to be called Google Refine, as it was a project by a Google engineer. It is free and open source.

Getting started

  • Open Refine is available for Mac, Windows and Linux. Go to openrefine.org. Find Download, and download the Mac Kit to your desktop. Double-click on the .dmg file to open.
    • If you are on a lab machine, instead of dragging to the Applications folder, drag it out of the window to the Desktop.
    • If on a Mac, and upon launch you get a message about SE Java 6 runtime, go here and install it.
    • If on a Mac and you get a message that the application needs to be moved to the trash, follow these steps:
      • Launch Terminal
      • Type this and then hit return. You might also need your computer password: codesign -f -s - /Applications/Google\ Refine.app/Contents/MacOS/JavaApplicationStub
      • Go to your Applications folder and control-click on the Google Refine icon and choose Open. The dialog that comes up will have an Open button. Use that, and then after that a simple double-click on the app will be fine.
  • Instead of dragging to Applications (you probably don’t have access), drag it to your own Documents folder.
  • Double-click the Refine gem icon in your Documents folder to launch the app. This will open it in Chrome at http://127.0.0.1:3333/. You can use Firefox or Safari if you have to, but stay away from Internet Explorer.

Start a project

  • Download Data To Clean. Make a copy (because you always do!)
  • Open and inspect it in Excel. What might be wrong with it? Close it
  • In Refine, go to Create Project. Use Choose Files to find your worksheet and then click Next.
  • Review the import settings. We don’t need to change anything this time, but you might in other cases. Name the project at the top and Create.
    • If you ever have a problem where Refine won't open data from and Excel file, try converting it to a .csv file and importing.

Getting around

  • Change the number of rows you can see
  • Page through the data using next, last, previous, first.

Fill down

We want the city to be present in all our data. We “know” that the city was entered only for the first value, and that the other rows get that same value until it reaches a new city. So we’ll fix that.
  • Under the Area column, click the drop down to Edit cells > Fill down.
  • Page through your data to see what it did further down when the city changed to Hereford.

Create a new column

You usually want to keep your original data so you can refer back to it, so we’ll make a new column before we change our data significantly.
  • On the Street column, use the dropdown to Edit Column > Add a column based on this column.
  • Name the new column “New Street.”
  • Note the Expression statement. It simply has “value” which is saying just to take the current value of the old column and put it in the new column. The preview shows you what the old and new column will be. While we won’t do so right now, it is possible to make change to the values (the content) as you create the new column using GREL, a programming language specific to Refine.
  • Click OK to create the new column.

Check out Undo

  • Click on the Undo/Redo column and note that you can back out of the changes you’ve made. This is handy … allowing you to experiment in Refine.
  • Try the Undo/Redo steps.

Search and replace

We’re going to fix the funky character that is there instead of spaces. We are going to use the GREL programming function “replace”. GREL is very powerful, and there is lots to learn, but we’ll stick with this one string statement today.
  • In your New Street column, go choose the dropdown and go to Edit Cells > Transform.
  • Now we’re going to use the Expression replace(value,”search”,”replace”) to make a change. Type** this into the Expression: replace(value,“”,””).
    • ** Be sure to type the expression into Refine and DO NOT copy and paste from this doc. If you copy and paste you will get a parse error because it won’t understand the double quote marks.
  • Remember that “value” means whatever is already in the cell, so we’ll keep that as it is. Since the thing we are searching for is the funky å character, copy and paste one of those from the preview window into the first set of quote. We are replacing it with a space, so put a space between the second pair of quotes.
  • You can see the before and after in the preview window. Work on it until it looks right.
  • Click OK. Note that Refine tells you how many cells were edited. Sometimes that is helpful to know if you got everything right.

Change case

Now we’ll fix the title case on all the street names.
  • Go to the dropdown for NewStreets and choose Edit cells > Common transformations > To title case.
  • This changes all the words in these cells. Sometimes title case won’t work if the contents aren’t all proper names, so you might have to go UPPERCASE.

Text facets

Now we’ll fix some of the street names.
  • Choose the dropdown for New Streets and choose Facet > Text facet.
  • This groups all the the cells that are the same together in the pane on the left, and tells you how many rows match that value.
  • You can pick through that list and fix multiple cells at once by hovering over the item, choosing Edit, and then making your changes. Edit both the records for Armarda Pub, Gravelly Hill and compare them to see the different. Are there any common transformations that might fix that?
    • On New Streets go to Edit cells > Common transformations > Collapse consecutive whitespace. Do the same for Trim leading and trailing whitespace.

Clustering

There are still 200+ choices for this field, and that is quite a lot to go through. There is a tool that can help you find similar fields.
  • In the New Street facet pane on the left, click the Cluster button.
  • This will bring up a window that finds clusters of cells based on an algorithm. There are a couple of different Methods and several Keying functions for each. You’ll use these to find likely matches and merge them. There is a link there if you want to learn more about the different algorithms.
  • If you click on a value in the “Values in Cluster” column, it will activate the Merge checkbox and and set the New Cell Value column to that value. Sometimes you want to edit that to a new value.
    • Make this first value “High Street, Erdington”.
    • Choose the best value for the other matches (note you can skip a group if needed.)
    • Click Merge Selected & Re-Cluster.
  • Change the Keying Function to “ngram-fingerprint”.
    • Note you can change the Ngram size. Change it to 10 to see what happens. Make your changes and then Merge Selected & Re-Cluster.
  • Change the Keying Function to “metaphone 3”. With this one, change the sliders to the right of the values and see what happens. (I haven’t used these much, but could see how it might be useful to focus on certain things.)
  • Continue on with the other Keying Functions, then try changing the Method.

Exporting

Once you’ve done the cleaning you want to do, you can export your file back out of Refine in one of several formats.
  • Go under the Export link at the top-right of the page to Excel (or whatever format you want).
  • This will download the file to our Downloads folder.

Resources to help with Refine


0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.