Monday, February 9, 2009

RegExp Options Explained

Now we will get to the various options you may give your regexp-engine. Please keep in mind that this depends on the engine you are working with. I will explain the options that apply to most implementations of regular expressions.

You should use The RegexCoach and play around with the examples while checking and unchecking the options.




(i) -> case insensitive matching

/(cat.*?)(cat)(.*?)$/i

The Cat and the Cat?

If we use case insensitive matching it will not matter if ,,cat’’ is capitalized or not.

(s) -> single line matching

/(cat.*?)(cat)(.*?)$/is

the cat and
the cat?

The engine will do a multi line match without the need to put a \n. In this case the newline is included in the wildcard (dot). In other words the string is regarded as a single line even if it contains more than one line.

(m) -> multi line matching

(cat)(.*?)[\?]$

the cat and dog?
the cat?
cat?

The multi line option changes the behavior of the anchors ^ and $ . The will work for each line and not start and end of the whole string.

(x) -> exclude unescaped whitespaces

(cat)\s(.*?) [\?]$

the cat and dog?

We have an escaped whitespace \s and an unescaped space _ after the second group (). If you turn (x) on, the engine will ignore that there needs to be a space before the ?.

(g) -> match globally

(cat)(.*?)

the cat and dog?
cats and dogs

If you wand to use the expression several times on a string, you need to make it global. This will continue the match on the string after the first match. This is very useful if you use the match in a while context.

Friday, January 30, 2009

Look Ahead Behind and Around

The look-ahead is used to only match the string if something lies ahead or does not lie ahead of the current match. The trick is that the look-ahead does not consume parts of the string. A second match would also consider parts of string, where the previous look-ahead matched.

positive look-ahead

(truck|car)s(?=\s)

Do trucks like cars?

This will only match trucks because cars is not followed by a whitespace.

negative look-ahead

(truck|car)s(?!\s)

Do trucks like cars?

This will only match cars because it is not followed by a whitespace.

positive look-behind

(?<=\s)(truck|car)

Are cars scary?

This will only match the first car and not the car in s-car-y because a whitespace needs to be behind the match.

negative look-behind

(?<!\s)(truck|car)

Are cars scary?

This will match the second car in scary because an s is not a whitespace.

look-around

(?<=[\ss])(truck|car)(?=[sy])

Are cars scary?

This will match both ,,car’’ . whitespace or s on the left and s or y on the right.

(?<=[\ss])(truck|car)(?![sy])

Do cars like carpets ?

Now the car in carpets will match.

Wednesday, January 28, 2009

Alternation

Sometimes you do not need exact matches, or you need to decide between several options. You can define two or more options for the regexp-engine to match. This works like an logical or .

(vehicle|car|van|truck)

I love my car!
I like my van!
My vehicle likes me!
Do trucks like cars ?

This group matches anything with vehicle or car or van or truck.

If you put an s behind the group, only the plural will match.

(vehicle|car|van|truck)s

Do trucks like cars ?

Repetition

Like you have learned using the plus and star for repeating, it is also possible to specify the repetition of groups or character classes more exactly.

{20} repeat 20 times
{20,} repeat 20 times or more
{,20} repeat 0 to 20 times
{20,30} repeat at least 20 times and not more than 30 times.

\w{2,3}

Yes we can do it together!

Use the step over button in the RegExCoach to see what happens.

Another example would be to check the number of digits.

\d{6}

123456

This will match any number with exactly 6 digits.

You can also use the repetition after a group () .

([acnwe]){2,3}

Yes we can do it together, can we?

The character class must match 2 or 3 times.

Tuesday, January 27, 2009

Quoting

If you need to match strings with lots of special characters you can use quoting within regular expressions.

\Q to turn quoting on
\E to turn quoting off

(\w*)\s(\w*)\s([\Q(^%&$!/)\E]+)\s\!

He thought (%&$!/) !

The first two groups fetch the words. The third group fetches the third ,,word’’ and the rest is just one whitespace and an exclamation point.

The trick is that we do not need to escape every single character that has a special meaning.

\Q some special string \E
If you do not put the \E quoting is automatically turned off at the end of the regexp.

Monday, January 26, 2009

Escaping and special characters

The escape character in regexps is the backslash. To match a backslash you just need to escape it as well. Most functionality in regexps starts with a special character. If you like to match strings, that contain special characters you need to escape them with a leading backslash.

199\s\$

199 $ bargain.

As we have learned in anchoring the $ sign is used to anchor the regexp and the end of the string. If you need to match a $ sign you need to escape it with a backslash.

There are some other non printable characters that can be matched with an escape sequence. Please note that some of these work like a character class.

Let me introduce the most important escape sequences.

\s - white space ( not only 0x20 , but also TAB ENTER NEWLINE )
\n – new line ( well known for C programmers )
\t – tab
\d – digit – [0-9]
\w – word , alpha

The following example will make thing a little clearer.

\d\d\d\s\$\s\w*\.

199 $ bargain.

The 3 \d match any number with 3 digits.
The \s matches the whitespace
The \$ matches the $ because $ unescaped has a different meaning in regexps.
Then we have another whitespace \s
\w* Zero or more word characters.
\. A dot escaped because the dot unescaped is used as a wildcard in regexps.

The above regexp will also match a string like

999 $ watch.

Other special characters can be notated in hex code.
\x30 will match a character of hex 30

Thursday, January 22, 2009

Anchoring

It is possible to anchor your regexp to each end or both ends of the string. To make sure the match starts with the beginning of the string you need to use the ^ anchor. Remember that the ^ within a character class [] has a different meaning.

^[19]+

199 mountains reach the skies.

^[19]+ will not match

mountains reach 199 skies.

Because the 1 or 9 needs to be at the beginning of the string to make the regexp match. Try removing the anchor an see what happens.

Using the $ at the end of your regexp anchors it to the end of the string.

skies.$

mountains reach the 199 skies.

[1-9].*$

mountains reach the 199 skies.

^.*(1.+9).*$

mountains reach the 199 skies.
Form a group around 1 anything 9 and anchor it to both ends of the string.

Wednesday, January 21, 2009

Grouping

You can form groups in regular expressions and work with the results of groups. Let’s return to our initial example string. A group is simply formed by putting round brackets.

([0-9]*)([^0-9]*)

199 mountains reach the skies.


The first group will contain the 199 and the second group will contain the rest of the string. If you are using the RegEx Coach you may highlight the groups by clicking on 1 or 2 at the bottom.





If you do not need the result of a group. You can put a :? at the beginning. We will learn how to work with the results of groups later.

(:?[0-9]*)(:?[^0-9]*)

Sunday, January 18, 2009

Character Classes

Sometimes it is necessary to not exactly match text or numbers. In a character class you can list all possible characters or a range of characters that do the match. Let’s move away from our sample text a little and try some numeric and alpha numeric matching.

[0-9]*

0815

This will match any number. The characters must be in the range of 0 to 9.


[19]*.[the]*

199 mountains reach the skies.

Use the step over button ->> in The RegEx Coach to see what matches.

You would read a 1 or 9 Zero or N times + dot ( any character ) + t, h or e Zero or N times.

Negated character classes

It is also possible to create character classes by excluding characters or ranges. This is done by putting a ^ after the opening bracket.

[^0-9].*

mountains reach the skies.

This will match the string but will not match
199 mountains reach the skies.

Thursday, January 15, 2009

Greediness

Greediness

As you might have seen the wildcard behaves greedy.

Referring to my last post we had the regexp

t.+l

This text is all about the complicated stuff.

There is an l in all , but the greedy wildcard with the repetition matches up to the l in complicated.

To make your match non-greedy you simply put a ? after the + or *

t.+?l

Now you can see, that the regular expression matches two times.

This text is all about the complicated stuff.

This is very important because in the most cases you might wish to match the first occurrence
of something rather than the last.

Wednesday, January 14, 2009

Wildcard , Plus and Star

So what you've learnt so far is some very basic matching. We will now learn using the wildcard dot, plus and star

The dot is used to replace a single character. Let’s return so our sample string. Let’s match from t to l. Start The Regex Coach and enter t , put five dots and l. Now copy our sample string into the field below.

t…..l

This text is all about the complicated stuff.

Now edit your regexp and put more dots between the letters and watch what happens.

The dot is a wildcard for any other letter or whitespace. But it has to match exactly once.

If we don’t know how many characters are between our boundaries it is possible to use repetition with the +. Leave your sample text unchanged and type in the regular expression

t.+l

This text is all about the complicated stuff.

Putting the + after the wildcard dot means that the dot may occur several times.

By using the star instead of the plus you will have almost the same effect. The difference is that putting a star behind means that it does not need to match at all.

Try the difference between + and * .

e.*x and e.+x

You may also use them without the wildcard. Try l+ f+

l+.*f+

Tools

There are various regexp tools out there. I recommend you to get The RegEx Coach by Dr. Edmund Weitz. With this tool you get highlighted text matching while you interactively write your regular expressions. Get it now and install it.


Mac user can go for RegexPlor

It is not as powerful as The RegExCoach but runs on PPC Macs.


RegExp Basics

Let’s start with some very simple stuff. A regular expression can just contain the exact letters you want to match.

text

This text is all about the complicated stuff.

The regular expression /text/ will match the string. This is very simple you might think. So lets move on. With regular expressions you can also find multiple matches within a string.

te

This text is all about the complicated stuff.

/te/ will match two times.

t

This text is all about the complicated stuff.

/t/ will even match six times. The capitalized T will not match.

come

will not match the string but comp will match.


Monday, January 12, 2009

Learning RegExp Introduction

This is an online course about regular expressions. These are cryptic little things that do a lot of magic in most modern programming languages. The first time you encounter regular expressions you might wonder how someone could write such cryptic hieroglyphs. The aim of this course is that you learn by doing simple text matching. Unlike most webpages about regular expressions I will guide you step by step and not overwhelm you with all the possibilities. After you get familiar with the examples we will step on and make some more complicated things.