Wednesday, February 18, 2015

Data cleaning with Regular Expressions

Regular expressions (or regex for short) is a programming language used to match patterns and is very useful to search and parse data. This lesson is just an introduction to a very powerful programming skill. We’ll be using the Regex Cheat Sheet. Download a copy for yourself.
  • Download the file regex-OpiodDrugs and open it using Text Wrangler or code editor of your choice. Notepad++ is a decent PC one.
    • This is a list of drugs and their dosage that were in a single column in Excel. We want to split this into three different fields: Drug, Amount, Type.
    • You couldn’t do a simple search and replace to separate these … the combinations are too complex. We need MORE POWA!
    • We’ll do this by creating three groups of patterns using Regex, and then we will replace those patterns putting a tab in between so we have have different columns.
  • Copy the contents of OpiodDrugs.
  • Go to http://regex101.com/ and paste the contents into the “TEST STRING” field.
    • We will put our expression in the top REGULAR EXPRESSION field. As we create groups, they will be displayed in the MATCH INFORMATION on the right.
    • Since we will be using this on an entire file, we need to add some modifiers in the field after the "/" in REGULAR EXPRESSION. Add "gm", which means "global" and "multi-line".
  • Capturing the Drug
    • We know that the drug is always at the beginning of the string, so we will put in the character for that: ^
    • We know we want our drug name as a group so we will put in parenthesis to hold them: ^()
    • We know we want to grab certain types of characters, mainly letters, so we are going to build an expression inside brackets to say why type of characters we want. In fact, we want everything until we find a number, which starts the Amount part of the string. So, we’ll fill that bracket with the kinds of characters we need. Start with the brackets: ^([])
    • We know we only have uppercase letters, so we’ll start with A-Z. That means anything between the uppercase A and the uppercase Z: ^([A-Z])
    • Well, that got the first character. We need more of them, so let’s use the * to get "zero or more of".: ^([A-Z]*)
    • Take a minute to look over at the MATCH INFORMATION. This is where you’ll see our progress in making our groups.
    • That got the first word of each drug, but some drugs have a slash between them, and we need them both. Slashes are special characters in Regex because they can be a command, too. As such we have to “escape” the slash to let Regex know we are talking about the piece of text and not a command. You do that with a backslash. So, we’ll add the escaped slash to our brackets: ^([A-Z\/]*)
    • We’re almost there. We have one drug, the FENTANYL TRANSDERMAL PATCH that has spaces. In fact, the TRANSDERMAL PATCH part of that is probably a “Type” like TABLET, but I think we’ll have to deal with that later. We’re going to add a space to our character brackets so we can keep this in our Drug group: ^([A-Z\/ ]*)
  • Capturing the Amount
    • Looking at this, our amount always have numbers, and then some characters that follow it, until you get to a space. So, if we can grab all those combinations to a space, we’ll be golden. We can also deal with the space between the groups so we don't include them in our group.
    • We’ll start by creating our new group with new parenthesis: ^([A-Z\/ ]*) ()
    • Now we’ll add our first character class, which are numbers. It could be any number from 0-9. We put them in the character brackets: ^([A-Z\/ ]*) ([0-9])
    • OK, with the last one, we got a single number, but we need more than one, we we’ll add the * to catch “zero or more”: ^([A-Z\/ ]*) ([0-9]*)
    • OK, we now we need to grap the slash, which we have to escape again with the backslash. Let’s add it: ^([A-Z\/ ]*) ([0-9\/]*)
    • Now we need to get the series of letters for the amount. We could put in just the characters we need - MCGHR - but we’ll go ahead and grab all uppercase letters: ^([A-Z\/ ]*) ([0-9\/A-Z]*)
    • We got all the amounts!
  • Get the rest: The type
    • Now we need to account for the space between the Amount and the Type, so add your space after. When you do, look what happens to Match 30 and 31. We lost our Amount there ... why? Because that group does not have a space after it. We can account for that with the * command, which is for "zero or more of".
    • OK, we’re going to start our third group by adding some parentheses.: ^([A-Z\/ ]*) ([0-9\/A-Z]*) *()
    • We know we need just uppercase letters, so let’s add those in our character brackets: ^([A-Z\/ ]*) ([0-9\/A-Z]*) *([A-Z])
    • Make sure to check all your groups, because Match 30 & 31 are problematic again. Add our * and we are good: ^([A-Z \/]*) ([0-9\/A-Z]*) *([A-Z]*)
    • We’re almost there. We need to account for the spaces and commas, so we'll add both of those in the brackets: ^([A-Z \/]*) ([0-9\/A-Z]*) *([A-Z ,]*)
  • So now we’ve captured all our content and we are storing them into three groups. We’ll be able to reference that group in a search and replace in Text Wrangler.
    • Copy your Regular Expression from regex101 and then go over Text Wrangler and open the Opiods file if you haven’t already.
    • Do Cmd-F to open the find and replace window. In the Find section, paste in your Regular Expression.
    • In your replace section we will reference our groups using \1 for the first group, and \2 for the second, etc. In between these groups, we have to put a tab, which is a command character written as \t. So the replace field will look like this: \1\t\2\t\3
    • Once last thing, on matching you have to click the GREP button so the search and replace will recognize you are using Regular Expressions.
    • Hit Replace All, and you’ll end up with tabs where you want them. Copy and paste it into Excel so you can see.
  • So, there you have it … your data in separate fields. You will need to run TRIM on those columns as we didn't get quite all spaces, but it's a start.

Rubular.com is the regex tester I've used in the past that has a simplier interface, but I like the colors in regex101.

We actually wrote a longer expression than we had to, but did so to show the power of the individual characters. This string below would do the same thing … can you use your handout of common regex functions to figure out how?

^([\D]*) ([\w\/]*) *(.*)

compared to:
^([A-Z \/]*) ([0-9\/A-Z]*) *([A-Z ,]*)

We still have to trip the drug names because of extra spaces. There is probably a way around that, too.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.