Monday, February 9, 2009

RegExp Options Explained

Now we will get to the various options you may give your regexp-engine. Please keep in mind that this depends on the engine you are working with. I will explain the options that apply to most implementations of regular expressions.

You should use The RegexCoach and play around with the examples while checking and unchecking the options.




(i) -> case insensitive matching

/(cat.*?)(cat)(.*?)$/i

The Cat and the Cat?

If we use case insensitive matching it will not matter if ,,cat’’ is capitalized or not.

(s) -> single line matching

/(cat.*?)(cat)(.*?)$/is

the cat and
the cat?

The engine will do a multi line match without the need to put a \n. In this case the newline is included in the wildcard (dot). In other words the string is regarded as a single line even if it contains more than one line.

(m) -> multi line matching

(cat)(.*?)[\?]$

the cat and dog?
the cat?
cat?

The multi line option changes the behavior of the anchors ^ and $ . The will work for each line and not start and end of the whole string.

(x) -> exclude unescaped whitespaces

(cat)\s(.*?) [\?]$

the cat and dog?

We have an escaped whitespace \s and an unescaped space _ after the second group (). If you turn (x) on, the engine will ignore that there needs to be a space before the ?.

(g) -> match globally

(cat)(.*?)

the cat and dog?
cats and dogs

If you wand to use the expression several times on a string, you need to make it global. This will continue the match on the string after the first match. This is very useful if you use the match in a while context.

Friday, January 30, 2009

Look Ahead Behind and Around

The look-ahead is used to only match the string if something lies ahead or does not lie ahead of the current match. The trick is that the look-ahead does not consume parts of the string. A second match would also consider parts of string, where the previous look-ahead matched.

positive look-ahead

(truck|car)s(?=\s)

Do trucks like cars?

This will only match trucks because cars is not followed by a whitespace.

negative look-ahead

(truck|car)s(?!\s)

Do trucks like cars?

This will only match cars because it is not followed by a whitespace.

positive look-behind

(?<=\s)(truck|car)

Are cars scary?

This will only match the first car and not the car in s-car-y because a whitespace needs to be behind the match.

negative look-behind

(?<!\s)(truck|car)

Are cars scary?

This will match the second car in scary because an s is not a whitespace.

look-around

(?<=[\ss])(truck|car)(?=[sy])

Are cars scary?

This will match both ,,car’’ . whitespace or s on the left and s or y on the right.

(?<=[\ss])(truck|car)(?![sy])

Do cars like carpets ?

Now the car in carpets will match.

Wednesday, January 28, 2009

Alternation

Sometimes you do not need exact matches, or you need to decide between several options. You can define two or more options for the regexp-engine to match. This works like an logical or .

(vehicle|car|van|truck)

I love my car!
I like my van!
My vehicle likes me!
Do trucks like cars ?

This group matches anything with vehicle or car or van or truck.

If you put an s behind the group, only the plural will match.

(vehicle|car|van|truck)s

Do trucks like cars ?

Repetition

Like you have learned using the plus and star for repeating, it is also possible to specify the repetition of groups or character classes more exactly.

{20} repeat 20 times
{20,} repeat 20 times or more
{,20} repeat 0 to 20 times
{20,30} repeat at least 20 times and not more than 30 times.

\w{2,3}

Yes we can do it together!

Use the step over button in the RegExCoach to see what happens.

Another example would be to check the number of digits.

\d{6}

123456

This will match any number with exactly 6 digits.

You can also use the repetition after a group () .

([acnwe]){2,3}

Yes we can do it together, can we?

The character class must match 2 or 3 times.

Tuesday, January 27, 2009

Quoting

If you need to match strings with lots of special characters you can use quoting within regular expressions.

\Q to turn quoting on
\E to turn quoting off

(\w*)\s(\w*)\s([\Q(^%&$!/)\E]+)\s\!

He thought (%&$!/) !

The first two groups fetch the words. The third group fetches the third ,,word’’ and the rest is just one whitespace and an exclamation point.

The trick is that we do not need to escape every single character that has a special meaning.

\Q some special string \E
If you do not put the \E quoting is automatically turned off at the end of the regexp.

Monday, January 26, 2009

Escaping and special characters

The escape character in regexps is the backslash. To match a backslash you just need to escape it as well. Most functionality in regexps starts with a special character. If you like to match strings, that contain special characters you need to escape them with a leading backslash.

199\s\$

199 $ bargain.

As we have learned in anchoring the $ sign is used to anchor the regexp and the end of the string. If you need to match a $ sign you need to escape it with a backslash.

There are some other non printable characters that can be matched with an escape sequence. Please note that some of these work like a character class.

Let me introduce the most important escape sequences.

\s - white space ( not only 0x20 , but also TAB ENTER NEWLINE )
\n – new line ( well known for C programmers )
\t – tab
\d – digit – [0-9]
\w – word , alpha

The following example will make thing a little clearer.

\d\d\d\s\$\s\w*\.

199 $ bargain.

The 3 \d match any number with 3 digits.
The \s matches the whitespace
The \$ matches the $ because $ unescaped has a different meaning in regexps.
Then we have another whitespace \s
\w* Zero or more word characters.
\. A dot escaped because the dot unescaped is used as a wildcard in regexps.

The above regexp will also match a string like

999 $ watch.

Other special characters can be notated in hex code.
\x30 will match a character of hex 30

Thursday, January 22, 2009

Anchoring

It is possible to anchor your regexp to each end or both ends of the string. To make sure the match starts with the beginning of the string you need to use the ^ anchor. Remember that the ^ within a character class [] has a different meaning.

^[19]+

199 mountains reach the skies.

^[19]+ will not match

mountains reach 199 skies.

Because the 1 or 9 needs to be at the beginning of the string to make the regexp match. Try removing the anchor an see what happens.

Using the $ at the end of your regexp anchors it to the end of the string.

skies.$

mountains reach the 199 skies.

[1-9].*$

mountains reach the 199 skies.

^.*(1.+9).*$

mountains reach the 199 skies.
Form a group around 1 anything 9 and anchor it to both ends of the string.