Regular Expressions (Regex or Regexp)

I use Regular Expressions (or Regex or Regexp as they are fondly called by programmers) a lot in my line of work as a programmer. This makes a lot of mundane editing tasks much  easier and even interesting. But, most of the programmers I know, either don’t know it or even if they heard of it, they are afraid to use it. So, here is my attempt at explaining this to everyone, whether you are a programmer or otherwise (user).

Real world Example 1
You all know the vowels. Remember?

A E I O U*

And remember the articles? There is a rule in English grammar to use articles while referring to nouns – one of a, an, the. General rule is, we have to use “an” when a noun begins with one of the above 5 vowels*. Remember that?

* I will ignore H in hour or honor for simplicity sake

For a non-native speaker of English, this may become confusing. That too, if their native languages don’t have the concept of using articles (they may just use “0ne” or just drop it altogether).

Imagine someone with less experience in articles wrote this flyer you asked for:

“A home insurance program is for the          regular joe like you and me. This home insurance program is a boon to the middle            class family                        to protect ­a asset. It’s a umbrella coverage that can cover a item or the complete home.

The insurance agency is upstairs. You take a elevator          and go up. You can talk to the advisor and get it all setup quickly. It is a operation, that is                  simple and easy to use, why wouldn’t you?”

All is well, you are about to print 100’s of this, but then you realize formatting is off. Look at the load of spaces in the middle. And yes, as a native speaker, those a’s in bold are so glaring to you (Flyer didn’t have the bold, let’s say the reader marked with thick pen). Now, you can go back to the computer and correct each entry there individually. On top of that, what if the company name is actually “Affordable Home Insurance” not just Home Insurance. You have to carefully change the “a”s there too.

This is only a few lines, so all is well. But, what if you are proof reading an author made this mistake randomly in the draft of his “first book” in English? Will you go back and look for it and correct one by one?

If you are a patterns kind of guy (or girl) you might think… Hmm. if only there is a way to tell the editor (or word processor) that you are looking for all the words that begin in a vowel and make sure any “a” before, be changed to “an”. And wherever there is a load of space, just replace with a single space. Bingo!! With couple of replacements, your entire document is proof-read.

You, my friend, just did a regular expression on me! This is possible and available in most modern editors.

(For the “an” problem you would have said a[ ]+([aeiouAEIOU])(.*). For finding multiple spaces, you could simply say [ ]+. And they could easily be replaced with An $1$2 and ” “ (single space) on regexr.com. More on these later).

Real world Example 2
For a sightly more complicated example, imagine you are working on RSVPs for a birthday party, you are arranging. You sort of created a simple questionnaire and sent out emails to people, like this:

Name:
Will you attend:
Will you be here for lunch:
How many:

And you got a lot of replies, say 100. You somehow managed to copy those into the below text file.

RSVP 1:

Name: Sam Varadarajan
Will you attend: Yes
Will you be here for lunch: Yes
How many: may be 2 or 3

RSVP 2:
Name: John Doe
Will you attend: X
Will you be here for lunch: –
How many: 1

RSVP 3:
Name: Jane Doe
Will you attend: YES
Will you be here for lunch:
How many: 2+

RSVP 4:
Name: Roboto Monstero
Will you attend: NO
Will you be here for lunch:
How many: 0

ETC

Then you realized, oh oh. It would have been nice to have them all listed out in a table? Like Excel?

# Name Coming Lunch? Howmany?
1 Sam Varadarajan Yes Yes 2 or 3
2 John Doe X 1
3 Jane Doe YES 2+
4 Roboto Monstero NO 0
Total (minimum) 5

You can go ahead sit there at your computer and copy each entry to the corresponding cell in Excel. Eventually, you will get there. What if you are doing this for a community project, say a marathon registration and there are literally 100s of people responding? (Of course, the questionnaire will be larger too).

You pattern matching girl will immediately think, if only I can tell the computer (editor or Word processing software you are using) to copy each RSVP (starts from RSVP and runs through How many: …) and collapse it to one line each. In doing so, pick up only numbers for some responses, consider YES/yes/Y/X as same. This is doable.

In all such scenarios, where you are dealing some (fuzzy?) text patterns and you want a tool/technique to blindly find and replace so as to standardize it, you are looking at Regular expression.

Or more formally, Regular expressions are a set of search patterns used to identify some text without specifically looking for it. Here is what Wikipedia has to say about them:

In theoretical computer science and formal language theory, a regular expression (sometimes called a rational expression)[1][2] is a sequence of characters that define a search pattern, mainly for use in pattern matchingwith strings, or string matching, i.e. “find and replace”-like operations.

If you have used PCs (or Unix) for a long while, you surely heard Star-dot-Star (*.*)? This means all the files. So, in its rudimentary form * represents all the characters in a string and thus is a regular expression. This pattern (*.*) then matches files like MyFirstDocument.doc and UsefulProgram.exe. Another character you might have come across would be question mark “?” which points to any single character.

There you have it. Armed with ? and *, you can start matching up words.

?_*.???
A_beautifulbook.doc

*+*=*
2+2=4

Note: * = all and ? = 1 char is only applicable in some commands, particularly in OS command prompt. As you enter the Regex world, forget these meanings. Here a dot (“.”) Represents single (any) character and dot-star (“.*”) represents many such characters.

All the above examples are somewhat of a real life (user level?) situation. But, we programmers face this type of question/challenge every day. We get files in one format, we need to reformat it, so the users could consume it in a more presentable format. Or we parse a log to make some sense out of it. One could use Regex for all such scenarios. In fact, almost all programming languages have some constructs to support this, with Perl, Python on one end, with fully integrated Regex capabilities to Java, C#, Javascript having libraries to support it.

With that you are now entering the exciting (or scary!)  world of Regular Expressions! Sort of an innocent looking English expression, but very powerful tool indeed. We will continue in the next blog post.

Contd…

Advertisements

Comments, please?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s