I earlier posted about exciting world of Regex here. Let’s try to learn some more in this followup post.
First if you want to play with Regex, here is a nice tool you can use:
By default it comes up with a sample pattern and the matches in the given sample text. You can type in the pattern you are testing, the actual text in the middle box and if you chose Replace, you get to see what the output will look like.
Let’s use this tool to convert our RSVP list in free text to a spreadsheet list.
If you read my original post, we had 4 RSVPs. It looked somewhat like this:
Will you attend:
Will you be here for lunch:
Here is my session in Regexr testing this:
The bottom portion shows the Replaced text – converted and cleaned, readied to be copied out to Excel. The spreadsheet will look like this:
Here is the actual regular expression used in this example:
RSVP[0-9]+:\n+Name:[ ]*(.+)\nWill you attend:(.*)\nWill you be here for lunch:(.*)\nHow many:.*([0-9]+).*[\n]*
Let’s go through the regular expression for arriving at this:
First line would have, RSVP:
In Regex any number can be represented by [0-9]+ **. That is a digit is represented by any character between 0 and 9. A whole number would then be one or more digits.
Thus, RSVP: can be written as,
Next text is actually on another line. This can represented by \n. You may know this is newline. Now we are on the second line.
This can be matched by, Name: (.*)
That would be any character after Name: . If you notice, we have a space after Name:. What if the user didn’t add any space or added many spaces in between? We can rewrite the above as,
That’s a space inside the square brackets. [ ]* represents 0 or more spaces. This way, we will match it, even if there is no space between the label “Name:” and the name that followis it. If used with a +, it will mean one or more spaces – at least one space must be there.
Next we have (.*). Dot (.) is any character and .* means zero or more characters. By enclosing a pattern within parantheses (), we make a “Capture Group”. This is what will be remembered (like in a variable) when this pattern is matched with a text. This can be returned and useful in Replace functionality. Each captured group will have a number associated with it – like $1*** for the first, $2 for the second etc. This is also called back reference. More on this later.
And we continue with the rest of the lines. The replace pattern will look like this:
That’s all captured variables ($1 – $4) separated by tab characters recognized by Excel as cell separators (we could also use Comma). End it with \n to push to the next line.
Voila! We have successfully converted a free-form text into a spreadsheet.
That was a rudimentary example. And if you noticed, I left out some complications – for e.g., what if they said 1 or 2 people will be coming and I want to take the maximum, meaning 2.
The fact is, Regex is never used alone in a tool like this. Often times, you will build this (and more regex) into a script/program to achieve the full desired results. Perl and Python are heavily invested in Regex. Other programming languages (including Java) do support Regex using some library API.
To be contd…
** As an aside, this is similar to the BNF form for writing language grammar (I’ve considered only positive whole numbers for this discussion):
*** Sometimes you will see \1 instead of $1.