Perl Regular Expressions - Do's and Dont's

Introduction

Many of the more experienced users may think of this article like, "look how cute, he's stating the obvious."
That is fine. This article is written with regular expression newbies in mind. I often see stupid mistakes and every time again someone has to say, "why do you use /g when you do not want to search globally?" Or, "why are you matching case insensitive (/i) on a [A-Za-z] character class?"

I want to collect as many as possible mistakes like this and put them in this article, so that we have something to refer to when another stupid regex mistake is made.
Note: this is not a regex tutorial nor regex howto. Neither will I explain how to do The Obvious. If you have any questions regarding regexen after reading this article, RTFM, read the books, read the tutorials, use the Super Search, use Google.

Also know that this article is not there to tell you what you should do and shouldn't do. It represents my view on things - whether or not that view is based on views of others. When possible, I will also explain why I have the particular view and give examples and counterexamples and benchmarks.

One more thing before I go off. If you ever find this article in another place than perlmonks.[org|net|com], most of the links will not work (unless the copier has done his job right) for Perl Regular Expressions - Do's and Dont's was especially written with Perl Monks in mind.

Jargon

Before I finally start off, let's set some terminology.
For example, it might be useful to know that "regular expression(s)", "regexp(s)", "regex(es/en)", and "RE('s)" all refer to the same thing: regular expression(s).

Seven Rules of Thumb

This are the most important do's and don'ts to keep in mind when regexing.

1. Do use strict; and do use warnings;. Do use diagnostics; when you don't understand error messages. Do use these pragma's just because it is a good habit.: I think you should know why.
2. Do use the -T switch (taint mode; for a simple overview, check the Perl Security (perlsec) manpage) in a responsible manner when input from external sources may be unsafe.: When a script runs in taint mode, values from external sources (user input, environment variable values (and therefor CGI HEAD or POST values), input from files, etc) are considered 'tainted'. This means you can't do certain things with it, like eval. Also, things like unlink, chmod and the likes are prohibited on tainted data. This is very useful, because possibly unsafe data will not harm the system. Whenever you pass a tainted value to another variable, it is still considered tainted: not the variable, but the value is tainted. There are several ways to untaint data, which I am not about to mention here. You should check the above mentioned Perl Security (perlsec) manpage.
The easiest way to make a perl CGI script run in taint mode, is by making this the first line:
#!usr/bin/perl -T
The path to perl on your system may vary, but you get the idea. Optionally, you can also add the -w switch to enable warnings.
Note: some people say data from external sources is always unsafe. Personally, I don't agree, but it's worth mentioning.
3. Don't trust users, or the programs under their control. Some are ignorant, some are malicious. It's also possible people suffer from "Fat fingeritis" (© schodckwm) and accidentaly enter something evil :).: Some users don't know how a system is meant to be used and use it wrong. They might walk through security holes in your script you might not know about. So, make your scripts fool proof. Others might test the limits of your script and damage your system. This may or may not happen as a collateral damage. Keep them away.
It is a wise idea to check if users entered the correct value. For example, when you've written a simple menu where the user should choose a number from 1 to 5, inclusive, you should check if they really entered a valid number (however this one is pretty obvious). Another example: say you have a website. Users are allowed to choose a nickname with a minimum of 3 and a maximum of 16 characters. Of course you could easily set a max length property on the input field. But a more experienced user is able to save the HTML file to disk and change the value of the max length property, thereby allowing a much longer nickname. Don't trust users.
4. Do know what you want to achieve. • Do know how to achieve that. • • That is, do understand regular expressions.: If you want to paint something black, you will want to buy black paint. If you want to match a string for only lower case letters, you will want not to use the /i modifier.
5. Don't use regexes for formats without a definite syntax, like human language.: Regexes are good for pattern matching (and substitution), not for language analysis.
Note that regexes can be used to find, for example #include statements in a C++ file, but then realise regexes are not the right tool to parse a C++ file. Consider the use of Parse::RecDescent instead.
6. Do comment your code: When you re-read it, you will have forgotten what your code should do. Others won't know anyway. Also, see point four in the next chapter.
7. Do use CGI; (have a look at CPAN) when writing CGI scripts. • that is, don't invent the wheel. • • that is, do use modules.: The CGI modules offers you many functions to handle CGI data. The most important, in my view, is the param() funtion, which allows you to get HTTP parameter values, like filled in form fields. You could of course easily write functions like this yourself but most probably, your function won't be as good as the one of the CGI module.
This is the case with many modules. You can do it, but it will cost you too much time to do it as well as the module's author. And besides, why would you want to reinvent the wheel?

As you will see, you won't need the rest of this article if you apply these seven rules correctly.

What does this mean?

A general list with some more rules to keep in mind. These rules actually ensue from the Seven Rules of Thumb.

1. Don't use the /g (global matching) modifier when you don't want to search global (that is, through the entire string).

Global matching searches the entire string, where non-global matching searches until it finds what it's looking for (it will only reach the end of the string when it doesn't find the search pattern). Besides it is faster not to use /g when you don't need it, you won't use a hammer when you want to drive a screw into something, would you? Of course it is possible, but it is a little exaggerated.

To prove my point, here's a benchmark. I'll run two regexes 1,000,000 times: one of them without global search, the other one with global search.

#!/usr/local/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);


cmpthese(1_000_000, {
    "Non-global" => sub { "Perl is cool" =~ m/Perl/ },
    "Global" => sub { "Perl is cool" =~ m/Perl/g }
    }
);
[download]

This is the result:

                Rate     Global Non-global
Global     1351351/s         --        -9%
Non-global 1492537/s        10%         --
[download]

As you can see, the non-global search runs about 1,492,357 times in a second, whereas the global search runs only 1,351,351 times in a second.
You might think, "so what? Both of them are that fast, you won't even notice the differece." True, but note that this is only a very simple regular expression and note that the test script is only very small. As you might find out one day, your regexes will grow more and more complicated, as do your scripts. And why would you want to give your CPU a hard time?

2. Don't use the /i (case insensitive matching) modifier when case does matter. Don't use the /i modifier with regexes like m/[A-Za-z]/i.

For example, when there is a difference between "perl" (the executable) and "Perl" (the language), and you want to search a large file for all references to the language ("Perl)', don't check with m/Perl/i. You won't know whether you will find only the references to the executable or also to the language.
Do use the /i modifier when you don't know what you can expect. For example, don't expect all users to know the difference between "perl", "Perl", and "PERL" (which is wrong but many users seem not to know that). In that case, case insensitive checking is a good idea.
The regex m/[A-Za-z]/ already matches upper- and lowercase letters. Using a /i modifier here will only make your meaning unclear. "I am called MUBA, but you may also call me MUBA." Or: "I am called both uppercase and lowercase A-Z, but you may also call me both uppercase and lowercase A-Z."

3. Do know that different people may write the same thing in different manners.

For example (and I will just ignore the last Don't from the next chapter): a zipcode like "1234 AB" may be written like "1234AB" or "1234 ab" or "1234ab" by others.
Don't throw an error message when "1234 AB" is written like "1234ab". Do just match case insensitive (unless "1234 AB" actually is another zipcode than "1234 ab") and do match the space zero or more times (\s*).

4. Don't make regexes too complex. Although regexes are powerful, it might be easier to read and maintain them when you split up your single regex into multiple regexes. • Or do use the /x modifier and give your regex a nice lay-out. • • Do use different lines (but lined out equally) for search patterns and replace expressions when using s/// and tr/// or y///.

s/
    (                                           # Match and backrefere
+nce to either
        (?:                     
            Perl|                          # one or
            perl|                                 # another way
            PERL|                                 # to spell
            [Pp][Ee][Rr][Ll]                      # perl
        )|                                      # or
        (?:                
            Java|                                 # one or
            java|                                 # another way
            JAVA|                                 # to spell
            [Jj][Aa][Vv][Aa]                      # java
        )
    )                                 
/        # and replace it with
    lc($1) eq "perl" ?
        "$1 looks like a nice language to me"   # this when perl is fo
+und
    :lc($1) eq "java"?
        "I don't know $1 very well"             # or this when java is
+ found
    :"I do not wish to consider $1"             # or this when somethi
+ng else is found.
/ex;
[download]

(Yes, I know it is redunant to first check all common possibilities (Perl, perl, PERL) and then check all possibilities, but I had to make it complex, right? Besides, I am not perfect either. And note: this regexp really sucks. Read on to the see demerphq's comment on it.)
This way, you can easily see what you are looking for, how things are related to each other and how replacement is done.

5. Do know what your regex really means. • Do know about ^, $, variable interpolation, the matching rules as described by the Camel Book, modifiers, and the meaning of \n (newline) in combination with ., ^ and $

In other words, RTFM. Get a driving license before driving a car. Know about politics before standing for president. Know where you want to get before entering an airplane. Check if you are entering the right airplane before you enter it.

6. Do know about precedence.

Do know the difference between m/^a|b$/ and m/^(?:a|b)$/ (or m/^(a|b)$/ for short).
Do know the difference between m/(ab)/ and m/(?:ab)/. m/^a|b$/ will match either an "a" at the beginning of the string, or a "b" at the end, or nothing at all if both are absent. This is right if this is what you want. But say you want to match either an "a" or a "b", which should be the first and last letter of the string, you should use m/^(?:a|b)$/. The ?: makes sure no backreference (\1, \2, ..., $1, $2, ...) is made.
If, however, you want a string to begin with an "a" or a "b", followed by some (0..n) letters and end with the character it started with, use: m/^(a|b)[A-Za-z]*\1$/.

7. Do know what data to expect.

Besides Timtoady (TIMTOWTDI (for the uninitiated: There Is More Than One Way To Do It. This actually is the slogan of Perl)), 'regexpect' is another nice keyword to remember: regular expression - expectation. Always keep in mind what data you expect while creating or modifying.

8. Don't use modifiers for tr/// or y/// which are only useful for m// and/or s///. Also don't use modifiers for m// which are only useful for s///.

In other words, do know about modifiers. Do understand them. Do know how to use them. Do know the difference between tr///, y///, m// and s///. In other words, Do RTFM.

9. Don't untaint tainted values with a statement like ($untainted) = $tainted =~ m/(.*)/;.

I see little reason in doing so. There might be a situation, though, that this is ok, for example if you're absolutely sure the data comes from a trusted source.
But in most cases, it's like pumping air in your bicycle tire, then driving straight through shattered glass. It would be a better idea to avoid the glass and drive around it; it would be better to be more careful. Well, the same story goes for taint mode. First, you take the effort to be wise and use taint mode. Then, you just let insecurity slip into your program (unless you trust the source).
It's better to untaint data you've checked. For example, when you expect a two character string starting with a vowel, make sure that is the thing you untaint: m/^[aeiou].$/i. Now, malicious data is less able to slip in.

Validating addresses

1. Don't use regexes to validate e-mail addresses.: You can do so, only if you know the current specification very well and then only if you update your regex whenever the specification changes. Otherwise, users will try to enter their e-mail address (which is perfectly valid according to the current specification but not according to your regex) and will receive an error message.
There are regexes out there that do a fine job validating an e-mail address. They take quite some lines of code and even then they are not able to be entirely correct.
2. Don't use regexes to validate HTTP, FTP or other type of internet addresses.: You can do so, only if you know the current specification very well and then only if you update your regex whenever the specification changes. Otherwise, users will try to enter a home page address (which is perfectly valid according to the current specification but not according to your regex) and will receive an error message.
3. Don't use regexes to validate zipcodes when you accept international users.: Zip code syntax may vary in different countries.

Conclusion: because regexes are so powerful, people are tempted to use them to validate all types of addresses. Don't. You won't know the current specification of the particular address.

Language checking

1. Don't use regexes to validate HTML, XHTML, XML and the likes.: Others have already proven regexes are not up to the job.
Use one of the many HTML modules out there.
2. Don't use regexes to extract information from this type of files.: There are nice modules available that can do it for you, even better than you. Use them.

Credits

I think a word of thanks is in place here. For the warm welcomed help they gave me and the time they spent on reading and commenting this article, I would like to thank the people who gave me useful advice, and critical comments. Inspiration came from: users that made stupid unnecessary mistakes (yes, I am one of them).
I received comments from Gumpu, Dietz, sporty, BrowserUk and bart.
After that, I also got useful replies from nobull, demerphq and schodckwm.

"2b"||!"2b";$$_="the question"

Back to Meditations