Home / guides   Print version

Regular Expression

Publish date 01/02/2012

A regular expression is a pattern describing a certain amount of text.
It is perfect to search for a certain string in a larger text or to check user input. For example; if you expect an email-address, to verify if it is indeed an email-address.

For the record, not all regex engines (the piece of software that does the comparing) use the expressions in the same way. There will be some small differences.
So always test you code!

The most popular regex engines are:

  • Perl 5
  • open source PCRE engine (Used by PHP)
  • .NET regular expression library
  • regular expression package of Java JDK

The basics

You can set a string and it will match this string.

the regex cat will match cat in About cats and dogs.
Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the regex engine to ignore differences in case.

You can match it with:
cat|Cat or
[Cc]at

The dot or period . will match any character (Except newline characters).
So c.t will match cat, but also cit as well as cBt or c9t, ...

Set of characters

You can let the regex-engine check for several characters, for example:
gr[ae]y to match either gray or grey.

You can also specify a range of character [0-9]. This matched a single digit between 0 and 9.
Same with letters, but remember it is case sensitive.
So [a-zA-Z] will match all letters, upper and lower case.

Negated Character

You can also tell it to match everything but a certain character.
This can be done with a caret ^ (Inside square brackets []). But it still has to be a character.
q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match the q and the space after the q in Iraq is a country.

If you want the q to match in both cases, you can use Negative Lookahead.
q(?!u)
Round brackets that start with a question mark means lookahead.
The exclamation point ! means it is negative (equal = would be a positive lookahead)
This is already more complex, make sure you know what your are doing.
Grab a book and learn how regex engines work internal.

Repetition

There are 4 ways that allow things to repeat.
The question mark ?
It tells the engine to attempt to match the preceding token zero times or once, in effect making it optional.

The plus sign +
Tells the engine to attempt to match the preceding token once or more.

The star or asterisk *
Tells the engine to attempt to match the preceding token zero or more times.

Braces {3}
Repeats exactly 3 times.
or you can set minimum and upper limits
{2,4} Repeat between 2 and 4 times.
{4,} Repeat at least 4 times

For example if you want to match HTML tags, without any attributes:
<[A-Za-z][A-Za-z0-9]*>

The following would not be good solution, because this regex would match <5>, which is not a valid HTML tag.
<[A-Za-z0-9]+>

Or

If you want to search for the text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.

Tricky part, You need to remember that the engine reads from left to right. So in the next example it will test Get before GetValue.
Get|GetValue
Even if you have GetValue in your string, it will match the Get first.
You can fix it by setting word boundaries \b
\b(Get|GetValue)\b
What also would work:
\bGet(Value)?\b
The question mark makes Value optional.

Start and end of string

In short, the caret ^ means the start of the string. The dollar sign $ matches the end of the string.
^a matches a in abc but not in bca.

Shorties

To make your life easier, they made short references for the most used character classes.
\d (Digits) is short for [0-9].
\w (Words) stands for "word character", usually [A-Za-z0-9_]. Notice the underscore and digits!
\s (space) stands for "whitespace character", which characters this actually includes, depends on the regex flavor.
Mostly [ \t\r\n].
In plain words space, tab, carriage return and line feed.

Note: They only match 1 character.
If you want to \d to match a big number like 1563, you'll need to add the plus sign +
\d+

mendream men. menkind amen

Word Boundries

Word boundries are also very handy.
You already know that \b means start or end of word,
\B means not word boundry.
So \bmen\B will match menkind, but not men nor amen.

Examples

Here are some examples that I found useful:

pattern of how a normal e-mail address should look like.

$pattern='^[A-Z0-9._-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$';

Pattern of how a URL should look like.
$pattern="@((https?://|ftp://)([-\w\.]+)+(:\d+)?(/([\w/_\.\-\+\~\=\:]*(\?\S+)?)?)?)@";

Pattern of how a date should look like.
US style (mm-dd-yyyy):
$pattern="^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$";
European style (dd-mm-yyyy):
$pattern="^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$";

Pattern for a jpg, gif or png image:
$pattern="([^\s]+(?=\.(jpg|gif|png))\.\2)";

Pattern to check if a password is strong enough:
$pattern="((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,16})";
(8 to 16 caracter string, with at least 1 upper, 1 lower case letter and 1 digit)

 

Books

Here are some useful books:

 

TOP