Perl Regular Expression Tutorial
Contents
- Overview
- Simple Regular Expressions
- Metacharacters
- Forbidden Characters
- Things To Remember
Overview
A regular expression is a string of characters which tells the
searcher which string (or strings) you are looking for. The following
explains the format of regular expressions in detail. If you are
familiar with Perl, you already know the syntax. If you are familiar
with Unix, you should know that there are subtle differences between
Perl's regular expressions and Unix' regular expressions.
Simple Regular Expressions
In its simplest form, a regular expression is just a word or phrase to search
for. For example,
gauss
would match any subject with the string "gauss" in it, or which
mentioned the word "gauss" in the subject line. Thus, subjects with
"gauss",
"gaussian" or "degauss" would all be matched, as would a subject
containing the
phrases "de-gauss the monitor" or "gaussian elimination." Here are some
more
examples:
carbon
- Finds any subject with the string "carbon" in its name, or which
mentions carbon (or carbonization or hydrocarbons or carbon-based life
forms) in the subject line.
hydro
- Finds any subject with the string "hydro" in its name or
contents.
Subjects with "hydro", "hydrogen" or "hydrodynamics" are found, as well
as
subjects containing the words "hydroplane" or "hydroelectric".
oxy
- Finds any subject with the string "oxy" in the subject line. This
could be used to find subjects on oxygen, boxy houses or oxymorons.
top ten
- Note that spaces may be part of the regular expression. The above
expression could be used to find top ten lists. (Note that they would
also find articles on how to stop tension.)
Metacharacters
Some characters have a special meaning to the searcher. These
characters are
called metacharacters. Although they may seem confusing at first,
they add a great deal of flexibility and convenience to the searcher.
The
period (.) is a
commonly used metacharacter. It
matches
exactly one character, regardless of what the character is. For
example, the
regular expression:
2,.-Dimethylbutane
will match "2,2-Dimethylbutane" and "2,3-Dimethylbutane".
Note that the
period matches exactly one character-- it will not match
a string of characters, nor will it match the null string. Thus,
"2,200-Dimethylbutane" and "2,-Dimenthylbutane" will not be
matched
by the above regular expression.
But what if you wanted to search for a string
containing a period? For
example, suppose we wished to search for references to pi. The
following
regular expression would not work:
3.14 (THIS IS WRONG!)
This would indeed match "3.14", but it would also match
"3514", "3f14",
or even "3+14". In short, any string of the form "3x14", where x is any
character, would be matched by the regular expression above.
To get around this, we introduce a second
metacharacter, the backslash
(\). The backslash can be used to indicate that the
character
immediately to its right is to be taken literally. Thus, to search for
the
string "3.14", we would use:
3\.14 (This will work.)
This is called "quoting". We would say that the period in
the regular
expression above has been quoted. In general, whenever the backslash is
placed before a metacharacter, the searcher treats the metacharacter
literally
rather than invoking its special meaning.
(Unfortunately, the backslash is used for other things
besides quoting
metacharacters. Many "normal" characters take on special meanings when
preceded by a backslash. The rule of thumb is, quoting a metacharacter
turns
it into a normal character, and quoting a normal character may
turn
it into a metacharacter.)
Let's look at some more common metacharacters. We
consider first the
question mark (?). The
question mark indicates that the
character immediately preceding it
either zero times or one time. Thus
m?ethane
would match either "ethane" or "methane". Similarly,
comm?a
would match either "coma" or "comma".
Another metacharacter is the star (*). This
indicates
that the character immediately to its
left may be repeated any number of times,
including zero. Thus
ab*c
would match "ac", "abc", "abbc", "abbbc", "abbbbbbbbc",
and any string that
starts with an "a", is followed by a sequence of "b"'s, and ends with a
"c".
The plus (+) metacharacter indicates that
the character
immediately preceding it may be repeated one or more times. It
is just like
the star metacharacter, except it doesn't match the null string. Thus
ab+c
would not match "ac", but it would
match "abc", "abbc",
"abbbc", "abbbbbbbbc" and so on.
Metacharacters may be combined. A common combination
includes the period and
star metacharacters, with the star immediately following the period.
This is
used to match
an arbitrary string of any length, including the null string.
For example:
cyclo.*ane
would match "cyclodecane", "cyclohexane" and even
"cyclones drive me insane."
Any string that starts with "cyclo", is followed by an arbitrary
string, and
ends with "ane" will be matched. Note that the null string will be
matched
by the period-star pair; thus, "cycloane" would be matche by the above
expression.
If you wanted to search for articles on cyclodecane
and cyclohexane, but
didn't want to match articles about how cyclones drive one insane, you
could
string together three periods, as follows:
cyclo...ane
This would match "cyclodecane" and "cyclohexane", but
would not match
"cyclones drive me insane." Only strings eleven characters long which
start
with "cyclo" and end with "ane" will be matched. (Note that
"cyclopentane"
would not be matched, however, since cyclopentane has twelve
characters, not
eleven.)
These involve the backslash. Note that the
placement of backslash is important.
a\.*z
- Matches any string starting with "a", followed by
a series of periods (including the "series" of length zero), and
terminated by "z". Thus, "az", "a.z", "a..z", "a...z" and so forth are
all matched.
a.\*z
- (Note that the backslash and period are reversed
in this regular expression.)
Matches any string starting with an "a", followed
by one arbitrary character, and terminated with "*z". Thus, "ag*z",
"a5*z" and "a@*z" are all matched. Only strings of length four, where
the first character is "a", the third "*", and the fourth "z", are
matched.
a\++z
- Matches any string starting with "a", followed by
a series of plus signs, and terminated by "z". There must be at least
one plus sign between the "a" and the "z". Thus, "az" is not
matched, but "a+z", "a++z", "a+++z", etc. will be matched.
a\+\+z
- Matches only the string "a++z".
a+\+z
- Matches any string starting with a series of
"a"'s, followed by a single plus sign and ending with a "z". There must
be at least one "a" at the start of the string. Thus "a+z", "aa+z",
"aaa+z" and so on will match, but "+z" will not.
a.?e
- Matches "ace", "ale", "axe" and any other
three-character string beginning with "a" and ending with "e"; will
also match "ae".
a\.?e
- Matches "ae" and "a.e". No other string is matched.
a.\?e
- Matches any four-character string starting with
"a" and ending with "?e". Thus, "ad?e", "a1?e" and "a%?e" will all be
matched.
a\.\?e
- Matches only "a.?e" and nothing else.
Earlier it was mentioned that the backslash can turn
ordinary characters into
metacharacters, as well as the other way around. One such use of this
is the
digit metacharacter, which is invoked by following a backslash
with
a lower-case "d", like this: "\d". The "d" must
be lower
case, for reasons explained later. The digit metacharacter matches
exactly
one digit; that is, exactly one occurence of "0", "1", "2", "3", "4",
"5",
"6", "7", "8" or "9". For example, the regular expression:
2,\d-Dimethylbutane
would match "2,2-Dimethylbutane", "2,3-Dimethylbutane"
and so forth.
Similarly,
1\.\d\d\d\d\d
would match any six-digit floating-point number from
1.00000 to 1.99999
inclusive. We could combine the digit metacharacter with other
metacharacters;
for instance,
a\d+z
matches any string starting with "a", followed by a
string of numbers, followed
by a "z". (Note that the plus is used, and thus "az" is not matched.)
The letter "d" in the string "\d"
must be lower-case. This is
because there is another metacharacter, the non-digit
metacharacter,
which uses the uppercase "D". The non-digit metacharacter looks like
"\D" and matches any character except a
digit. Thus,
a\Dz
would match "abz", "aTz" or "a%z", but would not
match "a2z", "a5z"
or "a9z". Similarly,
\D+
Matches any non-null string which contains no
numeric characters.
Notice that in changing the "d" from lower-case to
upper-case, we have
reversed the meaning of the digit metacharacter. This holds true for
most
other metacharacters of the format backslash-letter.
There are three other metacharacters in the
backslash-letter format. The first
is the word metacharacter, which matches exactly one letter,
one
number, or the underscore character (_). It is written as
"\w". It's opposite, "\W", matches
any one character
except a letter, a number or the underscore. Thus,
a\wz
would match "abz", "aTz", "a5z", "a_z", or any
three-character string starting
with "a", ending with "z", and whose second character was either a
letter
(upper- or lower-case), a number, or the underscore. Similarly,
a\Wz
would not match "abz", "aTz", "a5z", or "a_z".
It would
match "a%z", "a{z", "a?z" or any three-character string starting with
"a" and
ending with "z" and whose second character was not a letter, number, or
underscore. (This means the second character must either be a symbol or
a
whitespace character.)
The whitespace metacharacter matches exactly one
character of
whitespace. (Whitespace is defined as spaces, tabs, newlines, or
any
character which would not use ink if printed on a printer.) The
whitespace
metacharacter looks like this: "\s". It's opposite,
which
matches any character that is not whitespace, looks like
this:
"\S". Thus,
a\sz
would match any three-character string starting with "a"
and ending with "z"
and whose second character was a space, tab, or newline. Likewise,
a\Sz
would match any three-character string starting with "a"
and ending with "z"
whose second character was not a space, tab or newline.
(Thus, the
second character could be a letter, number or symbol.)
The word boundary metacharacter matches the
boundaries of words; that
is, it matches whitespace, punctuation and the very beginning and end
of the
text. It looks like "\b". It's opposite searches for a
character
that is not a word boundary. Thus:
\bcomput
will match "computer" or "computing", but not
"supercomputer" since there is
no spaces or punctuation between "super" and "computer". Similarly,
\Bcomput
will not match "computer" or "computing",
unless it is part of a
bigger word such as "supercomputer" or "recomputing".
Note that the underscore (_) is
considered a "word" character.
Thus,
super\bcomputer
will not match "super_computer".
There is one other metacharacter starting with a
backslash, the octal
metacharacter. The octal metacharacter looks like this: "\nnn",
where "n" is a number from zero to seven. This is used for specifying
control
characters that have no typed equivalent. For example,
\007
would find all subjects with an embedded ASCII "bell"
character. (The bell is
specified by an ASCII value of 7.) You will
rarely need to use the octal metacharacter.
There are three other metacharacters that may be of
use. The first is the
braces metacharacter.
This metacharacter follows a normal character
and contains two number separated by a
comma (,)
and surrounded by braces ({}). It is like the star
metacharacter, except the length of the string
it matches must be within the minimum and maximum length specified by
the
two numbers in braces. Thus,
ab{3,5}c
will match "abbbc", "abbbbc" or "abbbbbc". No other
string is matched.
Likewise,
.{3,5}pentane
will match "cyclopentane", "isopentane" or "neopentane",
but not "n-pentane",
since "n-" is only two characters long.
The alternative metacharacter is represented by a vertical bar
(|). It
indicates an either/or behavior by separating two
or more possible choices. For example:
isopentane|cyclopentane
will match any subject containing the strings
"isopentane" or "cyclopentane" or
both. However, It will not match
"pentane" or "n-pentane" or "neopentane."
The last metacharacter is the brackets metacharacter. The
bracket
metacharacter matches one occurence of any character inside the
brackets
([]). For example,
\s[cmt]an\s
will match "can", "man" and "tan", but not "ban", "fan"
or "pan". Similarly,
2,[23]-dimethylbutane
will match "2,2-dimethylbutane" or "2,3-dimethylbutane",
but not
"2,4-dimethylbutane", "2,23-dimethylbutane" or "2,-dimethybutane".
Ranges of
characters can be used by using the dash (-) within the
brackets. For example,
a[a-d]z
will match "aaz", "abz", "acz" or "adz", and nothing
else. Likewise,
textfile0[3-5]
will match "textfile03", "textfile04", or "textfile05"
and nothing else.
If you wish to include a dash within brackets as one
of the characters to
match, instead of to denote a range, put the dash immediately before
the
right bracket. Thus:
a[1234-]z
and
a[1-4-]z
both do the same thing. They both match "a1z", "a2z",
"a3z", "a4z" or "a-z",
and nothing else.
The bracket metacharacter can also be inverted by
placing a caret
(^) immediately after
the left bracket. Thus,
textfile0[^02468]
matches any ten-character string starting with
"textfile0" and ending with
anything except an even number. Inversion and ranges can be combined,
so that
\W[^f-h]ood\W
matches any four letter wording ending in "ood" except
for "food",
"good" or "hood". (Thus "mood" and "wood" would both be matched.)
Note that within brackets, ordinary quoting rules do
not apply and other
metacharacters are not available. The only characters that can be
quoted
in brackets are "[", "]", and "\".
Thus,
[\[\\\]]abc
matches any four letter string ending with "abc" and
starting with
"[", "]", or "\".
Because of the way the searcher works, the following
metacharacters should
not be used, even though they are valid Perl metacharacters.
They
are:
- ^ (allowed within brackets)
- $ (allowed within brackets)
- \n
- \r
- \t
- \f
- \b
- ( ) (allowed within brackets. Note
that if you wish to search for parentheses within text outside of
brackets, you should quote the parentheses.)
- \1, \2 ... \9
- \B
- :
- !
Here are some other things you should know about regular
expressions.
- The archive search software searches only subject
lines, and all
articles within the same thread will also be displayed.
- Regular expressions should be a last resort.
Because they are
complex, it can be more work mastering a search than just sifting
through a long list of matches (unless you're already familiar
with regular expressions).
- We limit the number of articles which can be shown
to 200
or less. This is to minimize load on our system.
- The search is case insensitive; thus
mopac
and
Mopac
and
MOPAC
all search for the same set of strings. Each will
match "mopac", "MOPAC", "Mopac", "mopaC", "MoPaC", "mOpAc" and so
forth. Thus you need not worry about capitalization. (Note, however,
that metacharacter must still have the proper case. This is especially
important for metacharacters whose case determines whether their
meaning is reversed or not.)
- Outside of the brackets metacharacter, you must
quote parentheses, brackets and braces to get the searcher to take them
literally.
Copyright (c) Carl Franklin and Gary Wisniewski,
1994-1996. All rights reserved.