Pattern Matching with Regular Expressions
Sometimes you need to compare character strings to see whether they
fit certain characteristics, rather than to see whether they match
exact values. For example, you may want to identify strings that begin
with S or strings that have numbers in them. For this type of comparison,
you compare the string to a pattern. These patterns are called regular
expressions, nicknamed regex.
You have probably used some form of pattern matching in the past.
When you use an asterisk (*) as a wild card when searching for files
on your hard drive (dir ex*.doc, for example), you're pattern matching.
For example, ex*.txt is a pattern. Any string that begins with ex and
ends
with
the string .txt, with any characters in between the ex and the .txt,
matches the pattern. The strings exam.txt, ex33.txt, and ex3x4.txt
all match the pattern. Using regular expressions is just a more complicated
variation of using wild cards.
Pattern matching is used to check or validate the
input from a Web page form. If the information input doesn't match
a specific pattern,
it
may not
be something you want to store in your database. For example, if the
user types a U.S. zip code into your form, you know the format needs
to be five digits or a zip + 4. So, you can check the input to see
if it fits the pattern. If it doesn't, you know it's not a valid zip
code, and you can ask the user to type in the correct information.
Regular expressions are used to match strings to patterns. Many applications, such as text editors and word processors, allow searches using regular expressions. In particular, Dreamweaver can search your Web page files for strings using regex when you use Edit > Find and Replace.
PHP provides support for Perl-compatible regular expressions. The following section describes some basic Perl-compatible regular expressions. (Much more complex and powerful pattern matching is possible. See www.php.net/manual/en/reference.pcre.pattern.syntax.php for further explanation of Perl-compatible regular expressions.)
Regular expressions are combinations of the following:
To create regular expressions, you generally use a combination of literal characters and special characters. For example, suppose you wanted to accept only users whose last name begins with S in your database. I know, it doesn't make too much sense, but just pretend with me. A regular expression that tests for this would be:
^S.*
which means S at the beginning of the string, followed by a string of any other letters. You will see exactly what these special characters mean later in this section.
You match the regular expression with a string using a PHP function, as follows:
preg_match($regex,$string);
The function returns the number of times the regex matches the string. This will be either 0 or 1, because the function stops searching after the first match.
The regular expression must be enclosed by delimiting characters when used in the function. You can use any character that is not in the regex itself. For example,
preg_match("/^S.*/",$string)
In this case, / are used to enclose the regex, a common choice. However, if your regex contains a /, you must use a different character as a delimiter, such as # or &.:
OK, now let's look at this simple regex at work in an if statement:
$string = "Smith";
$regex = "/^S.*/";
if(preg_match($regex,$string))
{
echo "Match";
}
else
{
echo "No match";
}
You can write a script containing this code. When you run it, it will echo Match. You can try changing the value of $string to see what matches and what doesn't.
Before you can write useful regular expressions, you need to understand what special characters mean and when to use one. The rest of this section describes the most useful special characters. It includes many examples of regular expressions, showing what matches and what doesn't. You can follow along and test these examples using the code shown above. You can change the values of $regex and $string in your code and see whether they match. You may even find an error in my examples.
The following are the most useful special characters:
* Match a single character (.) (?)
You can match any single character with a dot (.). A dot means that
must be a character in the string. You can make a single character
optional by placing a question mark (?) after the literal character.
Regex |
Match |
Not a Match |
| .t |
at, xt |
ax, xx |
| m.x |
mix |
mx, miix |
| mi?x |
mix, mx |
miix |
| m.?x |
mix, max, mx |
miix, maax |
* Specify the location (^) ($)
You can specify that a string only matches when it occurs at the beginning of a line with a ^. You can specify that a string only matches when it occurs at the end of a line with $.
Regex |
Match |
Not a Match |
| ^Sir |
Sir, Sir John |
Dear Sir |
| John$ |
John, Sir John |
John Smith |
| ^Sir$ |
Sir |
Dear Sir, Sir John |
| ^.$ |
a (any line with only one character) |
aa (any line with more or less than one character) |
| ^?$ |
a (any line with zero or one character) |
aa (any line with more than one character) |
| ^$ |
(any blank line) |
(any non blank line) |
* Group Characters ( () )
You can group characters together, so that they are treated as one character, with parentheses.
Regex |
Match |
Not a Match |
| a(bc)?x |
abcx, ax, lax |
abx, acx |
| ^a.(bc)5 |
axbc5, aqbc56 |
axb5, aqc5, laxb |
* Match one of a set of literal characters ( [ ] )
You can put a set of literal characters inside square brackets. The pattern matches if any one of the set of characters is found. You can indicate a range of characters within the brackets by using a hyphen [-]. If you want to include a hyphen as a literal character, include it at the beginning or end of the set. Otherwise, it will be seen as indicating a range.
Regex |
Match |
Not a Match |
| a[bcd]ef |
abef, acef, adef |
aef, abcef |
| a[b-d]ef |
abef, acef, adef |
aef, abcef |
| a[,.-]b |
a,b, a.b, a-b |
ab, axb, a,-b |
| file[1-2][0-9]? |
file1, file15, file20 |
file, file3, file40 |
* Exclude a set of Literal Characters ( [^ ] }
You can put a set of literal characters inside square brackets, preceded by a circumflex (^). The pattern matches only if none of the set of characters is found.
| Regex |
Match |
Not a Match |
| a[^bcd]ef |
axef,akef |
abef, acef, adef, axkef |
| a[^b-d]ef |
axef, akef |
abef, acef, adef, axkef |
| file[^1-2][0-9]? |
file3, file45 |
file, file1, file10 |
* Match a string of characters ( + ) ( * ) ( {n} )
You can match a string of one or more characters by adding a plus (+) after the character. You can match a string of zero or more characters by adding an asterisk after the character. You can match a string of a specified number of characters by adding curly brackets enclosing a number after the character.
Regex |
Match |
Not a Match |
| ab+c |
abc, abbbc, abbbbc |
ac, axc |
| ab*c |
ac, abc, abbbc |
axc |
| ab{3}c |
abbbc |
abc, abbc, abbbbbc |
| ab{2,3}c |
abbc, abbbc |
ac, abc, abbbbbc |
| a[bc]+d |
abd. acccd. abccbd |
ad. axd |
| a(bc)*d |
ad, abcd, abcbcbcd |
abd, acd, axd |
| ^.+$ |
a, aaa (any line with one or more characters in it) |
(a blank line) |
* Match one of alternate literal strings ( ( | ) )
You can put a set of literal strings, each separated by a pipe bar (|), between parentheses. The string matches if it contains any one of the alternate literal strings.
Regex |
Match |
Not a Match |
| a(bc|de|fg)x |
abcx, adex, afgx |
ax, abx, abcdex |
| I (love|hate) carrots |
I love carrots |
I like carrots |
You can mix many special characters and literal characters together to build a regular expression. Really long, complex regular expressions can be built to match any conceivable string.
A special character is sometimes part of a literal string, such as a period at the end of a sentence. If you want to include a character in a literal string that can also be a special character, you need to escape the character, which mean you insert a backslash ( \ ) in front of the special character. For example, look at the following two regex:
^.$
^\.$
The first pattern matches a line that has one character on it—any
character. The second expression matches a line that has one dot
on it. The \ in front of the dot makes it into a literal dot, rather
than a special character that represents any one character. If you
need to include a \ as a literal character, you would include \\
in the regex.
The following are some examples of regular expressions:
* Matches any word of normal text
[A-Za-z][a-z-]*
The above expression includes a space at the end, after the *, to indicate the end of the word.
* Matches phone numbers
^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$
^[0-9)( -]{7,20}$
The first expression matches phone numbers of the format (nnn) nnn-nnnn. Notice that the parentheses are escaped. The second expression is a more flexible expression that matches phone numbers in various formats. The string can contain numbers, parentheses, spaces, and dots. The string must be at least 7 characters long but not more than 20 characters long.