Monday, February 16, 2015

Data cleaning with Excel

Data Visualization, Spring 2015


Sometimes you are in Excel and you might be able to do some cleaning there instead of going back and forth to Refine. These are just some of the formulas that you might find useful. (Here also is a list of all the text functions for Excel.)


Sometimes you want to pull just part of a field out, like the Zip code from an address. If what you want is consistent, you can pull just that from a cell into a new column. In this example, I wanted to make a real date out of separate month and year columns.
  • Open the original Use of Force AISD file.
  • Note the date column. It wasn’t there when I started. I had to build it from the date_of_incident (really a month) and year_of_incident columns. So delete that “date” column now.
  • Create a new column to the left of the “time_of_incident” column. We are going to put parts of C and D there, but in a mmm-yyyy format, so Feb-2009. Then I’ll convert that later to a real date.
  • The formula to get part of a string from a direction is: =direction(cell,number_of_chars)
  • In the new E2 cell, enter this (don’t copy/paste): =LEFT(C2,3)&”-”&D2
  • Copy that formula all the way down.
  • Now click on the heading for that column and try to change the format of that column (command-1) to a different date format, 3/14/01. Nothing happens. Excel can’t change it to a date because it is text. This is a difficult thing to fix, but we will. We will create a new column, set the date value to be mmm-yyyy, and then copy and paste the values of our strings into that field.
  • Create another new column to the left of “time_of_incident”.
  • Click on the column header and do command-1 to get Format.
  • Click on Category > Custom, and Type in mmm-yyyy. We’ve prepared this column to accept our new date format, but no matter how hard we try we can’t copy and paste from Excel into this column and make it work. Excel will always thing we are pasting just text. So we’ll fake it out by copying the contents of the column into Text Wrangler, then copying from there back into Excel.
  • Click on the header of your created column with your built dates and copy that column.
  • Launch Text Wrangler and paste the text into a new window.
  • Make sure everything is selected again, and copy it.
  • Now go to the first cell of your new column and paste it in. Now the dates should line up on the right instead of the left of the cell, which is an indication that it is a date.
  • Click on the header of that column, do command-1 to get your formatting window, and change the date to *3/14/01. Make sure the dates are right.

There is also a MID function that can help you pluck text from the middle of a cell.

Text to columns

We want the time_of_incident column to have a separate start time and end time. We will separate the content based on the hyphen.
  • Create two empty columns to the left of subject_ethnicity. (you can do this by highlighting two columns at once, then right-clicking on the first of those and choosing Insert.)
  • Copy and paste the time_of_incident into the left-most of the new columns. (We are preserving the original.)
  • Click on the time_of_incident column. Go to the Data ribbon to Text to Columns.
  • Choose Delimited, then hit Next, then for Delimiters, check the Other box and put in a hyphen. This will split the column at the hyphen. Hit Next.
  • On the next screen, you’ll see a preview of what will happen. Hit Finish.
  • You now have two columns. Name the first one Start Time and the other End Time.


Now that we created these two time columns, one of them is kinda whacked. If you edit one of the cells for End Time, you’ll see there is a leading pace. We can fix that.
  • Create a new column to the right of End Time. Your End Time should be column H and the blank one column I. (If not, that’s OK, but you’ll need to adjust the formulas below.)
  • The formula for to remove leading and trailing whitespace is: =TRIM(cell)
    • In cell I2, put in this formula: =TRIM(H2)
    • Copy that formula down the column.
  • Now we need to copy the values of these cells back into this column, so we have the text instead of the formula.
    • Click on the I2 column header and copy the contents.
    • Right-click on the I2 column to select it then choose Paste Special, and then choose “Values”.
  • Now you have your new, trimmed End Time column. You can delete the bad column.

Filtering to cluster

You can manually cluster and fix fields in Excel using Filter.
  • Go to the loc_name field and highlight the column.
  • Go to the Data ribbon and then click the Filter button (it looks like a funnel).
  • Click on the dropdown that was created for that column, and then look at the values. We’re going to fix Berkman Dr.
  • Click on the “Select All” button so no items are selected, and then choose the two Berkman values. Close the filter window.
  • Now you can copy the first cell in the list (which is correct) and then use the magical copy point to apply it to the rest of the values there.
    • If yuo have a lot of data, sometimes the magical copy point won’t work all the way to the bottom of the list. You might have to copy the value of the correct cell into your clipboard, and then select all the cells, and then copy the value into them.

Other tools to help


Post a Comment

Note: Only a member of this blog may post a comment.