...

Using Regular Expressions in PHP

gorilla-chimp

When I first started programming in PHP, I found regular expressions quite hard. They were complicated, looked ugly, were hard to figure out, and there wasn’t much in the way of do*****entation either.

What are Regular Expressions?

Regular expressions started out as a feature of the Unix shell. They were designed to make it easier to find, replace and work with strings — and since their invention, they’ve been in wide use in many different parts of Unix based Operating Systems. They were commonly used in Perl, and since then have been implemented into PHP.

What can I use them for?

There are a few common uses for regular expressions. Perhaps the most useful is form validation. You could use regular expressions to check that an email address entered into a form uses the correct syntax. We’ll look at that particular use shortly.

You could also use them to complete complex search and replace operations within a given body of text that would not be possible with PHP’s standard str_replace function. Yes, there are many posible uses for regular expressions!

How do I use them?

We will now take a look at how we might use a regular expression to check the syntax of an email address entered into a form that’s submitted to a PHP script.
There are two types of regular expression functions included in PHP:

the ereg functions — PHP’s standard regular expression syntax
the preg functions, which use a Perl-compatible regular expression syntax

For this example we will use the eregi function. The eregi function is used to match a string to a particular regular expression. The ‘i’ in the function name means ‘case insensitive’ — you can also use ereg if you want it to be case sensitive.

As you probably know, email addresses are always in a particular format:

username @ domain . extension

This makes them an ideal candidate to be tested with a regular expression. So let’s take a look at an expression that can be used to check the validity of an email address. We’ll look at each section of the expression individually, and then I’ll include a syntax reference at the end to help make sense of it all. But first, here’s the expression itself:

eregi(‘^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+.[a-zA-Z.]{2,4}$’, $email)

If you’re anything like I was when I first started using regular expressions, this probably looks very confusing! Now we will split it up into sections and make sense of each part individually:

^[a-zA-Z0-9._-]+@

This part of the expression validates the ‘username’ section of the email address. The sign (^) at the beginning of the expression represents the start of the string. If we didn’t include this, then someone could key in anything they wanted before the email address and it would still validate.

The square brackets contain the characters we want to allow in this part of the email address. Here, we are allowing the letters a-z, A-Z, the numbers 0-9, and the symbols underscore (_), period (.), and dash (-). As you’ve probably noticed, we have included letters both in capitals and lower case. This isn’t strictly necessary, as we’re using the eregi (case insensitive) function. But we have included them here for completeness, and to show you how the functions work. The order of the character pairs within the brackets doesn’t matter.

The plus (+) sign after the square brackets indicates ‘one or more of the contents of the previous brackets’. So, in this case, we require one or more of any of the characters in the square brackets to be included in the email address in order for it to validate. At the end is the ‘@’ sign, which means that we require the presence of one @ sign immediately following the username.

[a-zA-Z0-9._-]+.

This part of the expression is very similar to the section we just looked at. It validates the domain name in the email address. As before, we have a series of characters in square brackets that we’ll allow in this part of the address, followed by a plus (+) sign, requiring one or more of those characters.

At the end of this section, there is a backslash, then a period sign. This tells the expression that a period is required at this point in the expression (ie. between the domain and extension). However, the backslash is slightly more complicated. In a regular expression, a period actually means ‘any character’. In order to make this expression take the period’s literal value rather than use it as a wildcard for any character, we need to ‘escape’ it — this is done by preceding the period with a backslash. You may have come across this before if you use databases such as MySQL, as escaping characters is very important there too.

[a-zA-Z]{2,4}$

This is the final part of the expression. At the beginning is another set of characters enclosed in square brackets. This time, I have simply allowed the letters a-z and A-Z, because numbers and other characters are not valid in domain extensions.

Instead of the + sign we used before, here we have ‘{2,4}’ immediately following the square brackets. This means that we require between 2 and 4 of the characters from the square brackets to be included in the email address. So com, net, org, uk, au, etc. are all valid, but anything longer than these will not be accepted.

The $ sign at the end of the expression signifies the end of the string. If we didn’t include this, then a user could type anything after the end of the email address and it would still validate.

We could use this to check the email address that has been submitted by a form. This is a basic example, but gives you and idea of how regular expressions can be used.

if ($_REQUEST[‘action’] == ‘validate’) {
if (eregi(‘^[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+.([a-zA-Z]{2,4})$’, $_REQUEST[’email’])) {

echo ‘Valid’;
} else {
echo ‘Invalid’;
}
}

Syntax Reference

This is a quick reference to some of the basic syntax. We have already seen much of it earlier on, but there are a few new things here that you may find useful.

^ start of string
$ end of string
[a-z] letters a-z inclusive in lower case
[A-Z] letters A-Z inclusive in upper case
[0-9] numbers 0-9 inclusive
[^0-9] no occurrences of numbers 0-9 inclusive
? zero or one of the preceding character(s)
* zero or more of preceding character(s)
+ one or more of preceding character(s)
{2} 2 of preceding character(s)
{2,} 2 or more of preceding character(s)
{2,4} 2 — 4 of preceding character(s)
. any character
(a|b) a OR b
s empty space (known as whitespace)