Sunday, September 30, 2007

Using Perl Compatible Regular Expressions with PHP

Intended Audience

This tutorial is intended for the PHP programmer interested in using Perl Compatible Regular Expressions (or PCRE for short) to match or replace values within its target.

A basic understanding of PHP and an overview of Perl will be given in this tutorial. Knowledge of the intended use of regular expressions will be helpful, although this tutorial should make that clear.

Readers interested in learning more about Regular Expressions and PHP before reading this tutorial are encouraged to do so by referring to:

Zend.com's PHP manual entry on PCRE
http://www.zend.com/manual/ref.pcre.php

Zend.com's PHP manual entry on preg_* functions
http://www.zend.com/manual/function.preg-replace.php

Overview

This tutorial will show you how to replace, match and otherwise manipulate strings within a target by using regular expressions. A regular expression is used for complex string manipulation in PHP and can be very handy when one needs to validate a value, grab information from an outside source, locate information within a page, or replace all of one specified string or value with another. It can be a great timesaver by doing all of the "find and replace" work that one would otherwise have to do by hand. Regular Expressions can also take care of much of the double checking that one often does when, for example, making sure that e-mail address submitted to a listserv are correct, or that URLs submitted for links are in the proper format.

Learning Objectives

In this tutorial, you will learn how to use the following PHP functions:

preg_replace()
preg_match()
preg_match_all()
preg_replace_all()

Definitions

Regular Expression: A function consisting of an expression used for complex string manipulation in PHP.

PCRE: Perl Compatible Regular Expressions - use syntax widely used in Perl. Alternate Regular expressions can employ POSIX-extended syntax.

Target: The text file, web page, or other file that will be "searched" by the regular expression.

Object: The entity within the target that will be searched for.

Background Information

When using regular expressions, there is one thing more important than all the rest. That item is syntax. Without the proper syntax, and by which the exact definition of what you are searching for is set, the regular expression can function improperly without limitation or, it can simply not function at all. In order to alleviate constant syntax problems, a plan of attack and a bit of patience is usually the key. So, it is usually prudent with a regular expression, especially depending on the restrictions on has in place, to first over specify the target and then simplify down. This amount of scrutiny insures that the object you are searching for is the only one grabbed, instead of outputting large amounts of unnecessary and unwanted matches.

Prerequisites

As a prerequisite to understanding the Perl Compatible Regular Expressions, it is recommended that you review:


The links outlined above in the Intended Audience section


Become familiar and constantly refer to the Pattern Syntax section of the Zend manual when constructing future regular expressions:

http://www.zend.com/manual/reference.pcre.pattern.syntax.php

PCRE Syntax

The types of regular expressions that will be covered in this tutorial are called Perl Compatible Regular Expressions, or PCRE for short. What this means is that the regular expressions associated with these PHP functions closely follow the syntax used in Perl regular expressions. This syntax is primarily used with the PREG functions in PHP, i.e. preg_replace, preg_match, preg_quote, preg_split, preg_grep, etc. These functions tend to be slightly faster than their POSIX compatible relatives (eregi, ereg, ereg_replace, etc). The syntax used can be confusing, but insures a very specific set of search criteria. The best place to find a large portion of this syntax is in the PHP manual:
http://www.zend.com/manual/reference.pcre.pattern.syntax.php

The first type of syntax to cover involves meta-characters. These characters are stand in values that cause the expression to behave in a certain way. They come in most handy because regular expressions are used to find certain patterns represented by the object you are searching for within a target. For example, if you were searching for an e-mail address in a Web page, you would look for the combination of an @ symbol followed by a period and then a predictable array of endings, i.e. com, org, net, etc. A complete list of these meta-characters can be found at the aforementioned link, a few of the most important ones will be illustrated here.

/ indicates a delimiter (used in pattern modifiers or to begin/end an
expression)
^ indicates the start of the target string to match
$ indicates the end of the target string to match

\ is used as a general escape character
{ } encloses a minimum, maximum value - used to indicate number of characters in a matching string
( ) encloses a subpattern
| separates alternative patterns


So, for example:

/car/ (note the beginning and closing / for delimiters, these enclose your expression)

indicates that the regular expression is looking for the letters "car". So, if a sentence looked like such:

I own a car now.

A match would be found, and the match would return "car".
Alternatively, if we had put the word carriage into the sentence instead of car, a match would still be made, but it would only return the word "car" since that is what was requested in the regular expression.

Try it yourself:

php
preg_match
('/car/', 'I own a car now.', $output);
echo
$output[0];
?>

So, for another example:

/car.*/

indicates that the regular expression is looking for the letters "car", but looking for them with following characters. So, if a sentence looked like such:

I own a carriage now.

A match would be found, and the match would return "carriage".
Alternatively, if we had used the first sentence, a match would have been made and the words "car now" would have been returned. This is because the regular expression requested a match starting with "car" and all characters after it on that line.

Try it yourself:

php
preg_match
('/car.*/', 'I own a carriage now.', $output);
echo
$output[0];
?>

So, one more example:

/ca(r|nyon|)/

indicates that the regular expression is looking for the letters "car" AND also looking for the word "canyon". So, if a sentence looked like this:

I own a car now.

A match would be found, and the match would return "car". Alternatively, if the word car had been replaced with canyon, a match would have been made and the words "canyon" would have been returned. This is because the regular expression requested a match starting with "ca" and either "r" or "nyon" ending the match.

Try it yourself:

php
preg_match
('/ca(r|nyon|)/', 'I own a car now.', $output);
echo
$output[0];
?>

How the Scripts Work

In this tutorial, the following example is given:

E-mail Validation: uses a regular expression to make sure that an e-mail address is in fact a valid e-mail address. This can be used to validate e- mail address submitted via a form or to search a target page for e-mail addresses and to display them.

Script Overview

Having read the introduction, you should have an understanding of what regular expressions are. In the following example, the versatility of regular expressions is illustrated Remember, regular expressions are tools that can be used in a variety of ways, not just those illustrated here. They can make otherwise tedious and lengthy jobs a breeze. At the end of the tutorial, the regular expressions are matched with PHP in context to show how they would be used.

E-mail Validation

The regular expression this tutorial covers is also one of the most popular uses of regular expressions. By validating an e- mail address, one can insure that at least the format is correct (although it cannot validate if the address is authentic). This can prevent accidental submissions of partial e-mail addresses to your database or form, it can insure that an e-mail address submitted to a listserv is valid, or it can be used to search and replace all the e-mail addresses for contact on your website with a different or updated e-mail address. All of these uses can employ a regular expression that would make the job of checking e-mail addresses by hand, obsolete.


Code Flow

Assign the regular expression output to a variable ($okay).

Invoke the preg_match function in order to match the objects in the target file with the desired validation parameters.

php
$okay
= preg_match('/^[A-z0-9_\-]+\@(A-z0-9_-]+\.)+[A-z]{2,4}$/', $emailfield);
?>

Items to match:

begin with a delimiter /

then indicate the beginning of the line with ^

then, [A-Za-z0-9_\-] is any character A-Z, a-z, 0-9 and _ or - .

Then, indicate that this pattern is one or more with the + symbol.

Then, just add a [@] after the plus to look for the @ symbol in the e-mail address (a dead giveaway for validation scripts).

Now all you need is to repeat your previous criteria for matching (text between A and Z or the numbers 0 - 9)

Adding a () around the next subset at the additional [.] tells it to look for more text following the . (the .com,.net,etc)

Then adding a minimum/maximum bracket {2,4} tells it to look for an ending that falls within those values (i.e. .de, .au, etc)

Finally, the $ indicates the end of the target string

And the expression is closed with our ending delimiter /

Scripts

E-mail Validations (email.php)


php
if ($submit) {
$okay = preg_match(
'/^[A-z0-9_\-]+[@][A-z0-9_\-]+([.][A-z0-9_\-]+)+[A-z]{2,4}$/',
$emailfield
);
if (
$okay) {
echo
"E-mail is validated";
} else {
echo
"E-mail is incorrect";
}
}else {
?>

E-mail address:



}
?>

You can see immediately, if an e-mail address is provided that does not fit our criteria, it will be returned as incorrect. However, if it does fit our specified criteria, it is validated:





Here's what happens with a non-valid e-mail address is inserted:








Conclusion


That's all there is to it. With this example, you can readily see how useful regular expressions can be. They are tools that programmers, when effectively implemented, wield with an unmatched amount of power. While PHP does over the POSIX compatible regular expressions, which generally has slightly easier syntax, the PCRE regular expressions are generally reputed to have widespread acceptance, powerful implementations and tend to be faster when used with large files or large strings.

No comments: