Dynamic Web Development with PHP | Supplement


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pattern Matching with Regular Expressions

Pattern Matching with Regular Expressions

Sometimes you need to compare character strings to see whether they fit certain characteristics, rather than to see whether they match exact values. For example, you may want to identify strings that begin with S or strings that have numbers in them. For this type of comparison, you compare the string to a pattern. These patterns are called regular expressions, nicknamed regex.

You have probably used some form of pattern matching in the past. When you use an asterisk (*) as a wild card when searching for files on your hard drive (dir ex*.doc, for example), you're pattern matching. For example, ex*.txt is a pattern. Any string that begins with ex and ends with the string .txt, with any characters in between the ex and the .txt, matches the pattern. The strings exam.txt, ex33.txt, and ex3x4.txt all match the pattern. Using regular expressions is just a more complicated variation of using wild cards.

Pattern matching is used to check or validate the input from a Web page form. If the information input doesn't match a specific pattern, it may not be something you want to store in your database. For example, if the user types a U.S. zip code into your form, you know the format needs to be five digits or a zip + 4. So, you can check the input to see if it fits the pattern. If it doesn't, you know it's not a valid zip code, and you can ask the user to type in the correct information.

Regular expressions are used to match strings to patterns. Many applications, such as text editors and word processors, allow searches using regular expressions. In particular, Dreamweaver can search your Web page files for strings using regex when you use Edit > Find and Replace.

PHP provides support for Perl-compatible regular expressions. The following section describes some basic Perl-compatible regular expressions. (Much more complex and powerful pattern matching is possible. See www.php.net/manual/en/reference.pcre.pattern.syntax.php for further explanation of Perl-compatible regular expressions.)

Regular expressions are combinations of the following:

 
 
  • Literal characters: Normal characters, with no special meaning. An e is an e, for example, with no meaning other than that it's one of 26 letters in the alphabet.

  • Special characters: Special characters, on the other hand, have special meaning in the pattern, such as the asterisk (*) when used as a wild card.

 
 

To create regular expressions, you generally use a combination of literal characters and special characters. For example, suppose you wanted to accept only users whose last name begins with S in your database. I know, it doesn't make too much sense, but just pretend with me. A regular expression that tests for this would be:

^S.*

which means S at the beginning of the string, followed by a string of any other letters. You will see exactly what these special characters mean later in this section.

You match the regular expression with a string using a PHP function, as follows:

preg_match($regex,$string); 

The function returns the number of times the regex matches the string. This will be either 0 or 1, because the function stops searching after the first match.

The regular expression must be enclosed by delimiting characters when used in the function. You can use any character that is not in the regex itself. For example,

preg_match("/^S.*/",$string) 

In this case, / are used to enclose the regex, a common choice. However, if your regex contains a /, you must use a different character as a delimiter, such as # or &.:

OK, now let's look at this simple regex at work in an if statement:

$string = "Smith";
$regex = "/^S.*/";
if(preg_match($regex,$string))
{
     echo "Match";
}
else
{
     echo "No match";
} 

You can write a script containing this code. When you run it, it will echo Match. You can try changing the value of $string to see what matches and what doesn't.

Before you can write useful regular expressions, you need to understand what special characters mean and when to use one. The rest of this section describes the most useful special characters. It includes many examples of regular expressions, showing what matches and what doesn't. You can follow along and test these examples using the code shown above. You can change the values of $regex and $string in your code and see whether they match. You may even find an error in my examples.

The following are the most useful special characters:

* Match a single character (.) (?)

You can match any single character with a dot (.). A dot means that must be a character in the string. You can make a single character optional by placing a question mark (?) after the literal character.

Regex
Match
Not a Match
.t at, xt ax, xx
m.x mix mx, miix
mi?x mix, mx miix
m.?x mix, max, mx miix, maax

* Specify the location (^) ($)

You can specify that a string only matches when it occurs at the beginning of a line with a ^. You can specify that a string only matches when it occurs at the end of a line with $.

Regex
Match
Not a Match
^Sir Sir, Sir John Dear Sir
John$ John, Sir John John Smith
^Sir$ Sir Dear Sir, Sir John
^.$ a (any line with only one character) aa (any line with more or less than one character)
^?$ a (any line with zero or one character) aa (any line with more than one character)
^$ (any blank line) (any non blank line)

* Group Characters ( () )

You can group characters together, so that they are treated as one character, with parentheses.

Regex
Match
Not a Match
a(bc)?x abcx, ax, lax abx, acx
^a.(bc)5 axbc5, aqbc56 axb5, aqc5, laxb

* Match one of a set of literal characters ( [ ] )

You can put a set of literal characters inside square brackets. The pattern matches if any one of the set of characters is found. You can indicate a range of characters within the brackets by using a hyphen [-]. If you want to include a hyphen as a literal character, include it at the beginning or end of the set. Otherwise, it will be seen as indicating a range.

Regex
Match
Not a Match
a[bcd]ef abef, acef, adef aef, abcef
a[b-d]ef abef, acef, adef aef, abcef
a[,.-]b a,b, a.b, a-b ab, axb, a,-b
file[1-2][0-9]? file1, file15, file20 file, file3, file40

* Exclude a set of Literal Characters ( [^ ] }

You can put a set of literal characters inside square brackets, preceded by a circumflex (^). The pattern matches only if none of the set of characters is found.

Regex Match Not a Match
a[^bcd]ef axef,akef abef, acef, adef, axkef
a[^b-d]ef axef, akef abef, acef, adef, axkef
file[^1-2][0-9]? file3, file45 file, file1, file10

* Match a string of characters ( + ) ( * ) ( {n} )

You can match a string of one or more characters by adding a plus (+) after the character. You can match a string of zero or more characters by adding an asterisk after the character. You can match a string of a specified number of characters by adding curly brackets enclosing a number after the character.

Regex
Match
Not a Match
ab+c abc, abbbc, abbbbc ac, axc
ab*c ac, abc, abbbc axc
ab{3}c abbbc abc, abbc, abbbbbc
ab{2,3}c abbc, abbbc ac, abc, abbbbbc
a[bc]+d abd. acccd. abccbd ad. axd
a(bc)*d ad, abcd, abcbcbcd abd, acd, axd
^.+$ a, aaa (any line with one or more characters in it) (a blank line)

* Match one of alternate literal strings ( ( | ) )

You can put a set of literal strings, each separated by a pipe bar (|), between parentheses. The string matches if it contains any one of the alternate literal strings.

Regex
Match
Not a Match
a(bc|de|fg)x abcx, adex, afgx ax, abx, abcdex
I (love|hate) carrots I love carrots I like carrots

You can mix many special characters and literal characters together to build a regular expression. Really long, complex regular expressions can be built to match any conceivable string.

A special character is sometimes part of a literal string, such as a period at the end of a sentence. If you want to include a character in a literal string that can also be a special character, you need to escape the character, which mean you insert a backslash ( \ ) in front of the special character. For example, look at the following two regex:

^.$
^\.$

The first pattern matches a line that has one character on it—any character. The second expression matches a line that has one dot on it. The \ in front of the dot makes it into a literal dot, rather than a special character that represents any one character. If you need to include a \ as a literal character, you would include \\ in the regex.

The following are some examples of regular expressions:

* Matches any word of normal text

[A-Za-z][a-z-]* 

The above expression includes a space at the end, after the *, to indicate the end of the word.

* Matches phone numbers

^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$
^[0-9)( -]{7,20}$

The first expression matches phone numbers of the format (nnn) nnn-nnnn. Notice that the parentheses are escaped. The second expression is a more flexible expression that matches phone numbers in various formats. The string can contain numbers, parentheses, spaces, and dots. The string must be at least 7 characters long but not more than 20 characters long.