Regular Expressions Tutorial, the Basics (for BEGINNERS)

This node has been constructed to assist the new programmer in understanding the use of Regular Expressions.

Regular expressions are also called PATTERNS. Patterns are used to locate text strings within text lines. (Text lines are not the only pattern you can search, but for this tutorial, we are searching text lines.)

In this node, I will be specifically dealing with some basic building blocks of Regular Expressions. These are Pattern Searches, Alternative ( or Alternation) Search Patterns, Substitution Operators, Concatenation and Pattern-Matching Character Classes.

Here is a basic example of searching for a pattern in a sentence:

# Example 1
#
$_ = "I logged in as brusimm and found that I had email.";    # Test s
+tring
  if (/brus/) {                                               # Patter
+n: brus
      print "There, our pattern showed up in our string.\n" ; # Tell u
+ser
      } else {                                                # or
      print "No match in our string for our pattern.\n";      # As you
+'ll see...
  }                                                           # end
[download]

The variable $_ actually had what we were looking for, and so it printed the line. If you were to change (/brus/) to (/crus/), and rerun the script, crus is not there to be found, and our 'no match' print statement would print to screen because of that.

Notice the forward slashes around /brus/. Normally, we can use any delimiter with the pattern match call of m//. IE: m{brus}, or m,brus, etc. But I’ve initiated you to the shortcut of typing less while getting more! You will encounter pattern searches with //, but if you ever see the m//, you will know what it is.

In Example 1, our variable, $_ is just a single text line. If you were searching a whole file, you would rather use "while" than the "if", like this:

while (<>) {
    if (/brus/) {
        print $_;
    }
}
[download]

Another way to search for strings is with an Alternative ( | Alternation ) search pattern, which is presented by the vertical bar (|), or can be referenced as "or". It is one way to look for multiple terms. In the below example, if you replace /brus/ with /but/, we would get the output corresponding to "No match". IF we replace the /but/ with the following input of /but|brus/, we would be rewarded with the 'success' print statement because 'brus' OR 'but' showed up in the string we searched.

I want to search for one or more terms

# Example 2
#
$_ = "I logged in as brusimm and found that I had email.";    # Test s
+tring
  if (/but|brus/) {                                           # Patter
+nS: but or brus
      print "There, our pattern showed up in our string.\n" ; # Tell u
+ser
      } else {
      print "No match in our string for our pattern.\n";      # As you
+'ll see...
  }                                                           # end
[download]

This worked great for me when I had a large list of names, and I went looking for each line where one of several names that I was interested in showed up!

Let’s say I consumed way too much coffee & might have mistyped the test string:

# Example 3
#
$_ = "I logged in as bruuuusimm and found that I had email.";  # Our t
+est sentence
  if (/brus/) {   # Our pattern of brus
   print "There, brus showed up.\n" ;   # if pattern is found, print i
+t!
  }                                                                   
+                    # end
[download]

My search pattern would not work. It’s too explicit. BUT if we were to add an asterisk after the "u", telling Perl we weren’t sure of the number of u’s in the name, we might have better success finding the pattern. So rather than
if (/brus/) {
the line could be inputted as:
if (/bru*s/) {

and then we would have our print line show up.

Substitution Operators

Let’s say we know there were many instances in a file, where bruuuusimm occurred and we know we need to fix it. Really, we need to fix it!

The instruction would look something like this example line:

if (s/bru*s/brus/) {

Where every time we found bruuus, it would be replaced with brus. Notice the control character* of “s” before the FIRST forward slash. (*A control character initiates, modifies, or stops a program function, event, operation, or control operation.) Check it out with this script:

# example 4
#
$_ = "I logged into bruuuusimm and saw I had email.\n";  # original er
+ror
print $_ ;                                               # printing pr
+oof of error
if (s/bru*s/brus/) {                                     # fixing it
 print $_ ;                                              # proving we 
+fixed it.
 }
[download]

In example 4, I had the incorrect variable, and we printed it to prove that, and then after running our replacement, we printed it again to see if the replacement actually happened.

Something we may need to also think about is that issue of unintended matches. If perchance, we were doing a replacement within a large sum of text and the strings brs or abbrs were in that collection of text, those occurrences would also be changed, so we need to be aware of that scenario.MONKS: I'm Looking for additional reference material to show how to lock in the pattern I am using here. (Bruce)

If we wanted to match a single character, for example, an 'a', our pattern would be /a/.

# example 5
#
$_ = "I logged into brusimm and saw I had email.\n";
if (/a/) {
 print $_ ;
 }
[download]

The line printed. If you were to replace /a/ with /z/, the line would not print because there is no "z" (Corresponding match) in the line.

This apparently may not work on (\n), so be aware. (That's for another day)

Additionally, if we wanted to combine 2 string values into a single string, that can be accomplished by the operator . in your code line.

# example 6
#
print "Hello" . ' ' . "world"; #Same as 'Hello world'
[download]

Pattern-Matching Character Classes

Pattern-matching character classes is done by a pair of open and closed square brackets and a set of characters within the brackets.

The important thing to remember is that you only need one of these characters to be present in the reviewed string for a successful pattern to match.

What that means is if you run the following code:

# example 7
#
$_ = "I logged into brusimm and saw I had FIVE email.\n";  
if (/[xyz]/) {
 print $_ ;
 }
#
[download]

Nothing will print from the IF query because there are no x’s, y’s or z’s in the sentence. But, if you replace /[xyz]/ with /[abc]/, the IF query prints because one or more of the parameters was met.

NOW, let’s be careful here. I input lowercase letters. Had I input /[ABC]/, there would be no output to print, because case matters. Hmm.

So let’s try this: Instead of "abc", let's replace it with a lowercase "f", and run that example. As you see, nothing happens. If you replace the lowercase "f" with an uppercase "F", we should get a printout because in my test sentence, the number FIVE is spelled out, with an UPPERCASE "F".

Now if you were looking for "f", and not sure of the case, one way to implement the search would be by the following /[fF]/. We are now saying, look for both upper and lower case versions of this letter.

But wait, I do not want to type out the whole alphabet or a whole series of numbers to find something. My time is way too short to do that because I'm working on tutorials! Is there a shorter way, (Unlike this node) to do this?

Yep.. instead of /[abc]/ in example 7, you could put /[a-c]/. (Hmm, in this case that’s not less typing, but hopefully you get the point?

So let’s look at this example:

# example 8
#
$_ = "I make a black pencil line.\n";
if (/[q-zQ-Z]/) {
 print $_ ;
 }
[download]

Here I am looking for both upper and lower case letters from q to z. But the IF statement does not print. Oh yeah, there are no letters like that. Let’s replace the "q" with an "a". There, that’s better. This can also work for single digit numbers. I’ll let you try that on your own.

Now say you wanted to find a line that did not have certain characters. Humor me for a second: in your search pattern, /[q-z]/, you get no print out.. BUT, if you modify your search pattern to the following: /[^q-z]/, now it does print! Basically, the upper caret says match anything that IS NOT in this pattern! How’s them apples?! So if you wanted to find sentences with no numbers, you'd do this: /[^0-9]/. This upper caret is basically a NEGATED CHARACTER CLASS.

You may also want to check out muba's node on Regexp's Do's & Dont's.

Other Sources for this subject:

pulling vowels from a sentence,
perldoc notes,
Perl.com, More in depth look
Perltut
and, CPAN

That concludes this tutorial. Source of my information is Learning Perl, 3rd & 4th Editions, by Schwartz, Phoenix & Foy.

END PROPOSED TUTORIAL

Back to Meditations