Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Simplifying regexes

by AppleFritter (Vicar)
on Oct 26, 2015 at 16:46 UTC ( [id://1146003]=note: print w/replies, xml ) Need Help??


in reply to Simplifying regexes

As other monks have pointed out, Perl's regular expressions are indeed anything but regular in the formal sense. Formal regular expression use only the following operations:

  • Concatenation: if a and b are regular expressions, then so is ab.
  • Alternation: if a and b are regular expressions, then so is (a|b).
  • The Kleene star: if a is a regular expression, then so is a*.

If you want to wrap your head around what these regular expressions really mean, though, it's best to start with formal languages, specifically regular languages. Regular expressions describe these languages in a very natural and obvious manner.

That said, while fascinating, none of this will help a lot with simplifying Perl regular expressions in a Perl program. For that, you'll need a good intuition of how regular expressions work in Perl, and for that you'll simply have to use them, time and again.

If you're not getting along with Mastering Regular Expressions, BTW, the chapter on regular expressions in Programming Perl is eminently readable and accessible.

Replies are listed 'Best First'.
Re^2: Simplifying regexes
by ExReg (Priest) on Oct 26, 2015 at 17:23 UTC

    Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions. It would appear that the path to enlightenment is through practice, and not pedagoguery.

      Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions.
      That's pretty much what I wanted to say. Yes, it is probably interesting to look at theoretical regular expression, but it will have little value for your practical problem, because our Perl "regexes" are anything but regular.
      It would appear that the path to enlightenment is through practice, and not pedagoguery.

      This has been my experience. "Regular expressions," as they have evolved (into ir-regular expressions, as others have noted), are the most counterintuitive things I have encountered in programming.

      Here's my favorite example of this conceptual orneriness. What will the regex  /(b*)/ capture when run against the string  'aaaaabbbb' and where (i.e., at what character position offset)? Knowing as we do that  * matches "as much as possible" of something, I'd almost be willing to bet money that even the most regex-savvy will experience at least a minor knee-jerk twitch in the general direction of "it matches  'bbbb' at offset 5." But surprise, surprise:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'aaaaabbbb'; print qq{matches '$1' at offset $-[1]} if $s =~ /(b*)/; " matches '' at offset 0
      It matches our old friend the empty string at a location as far away from any  'b' as it could possibly get.

      The answer to the puzzle is that "as much as possible" cheats and leaves out the leftmost and equally important part of the "leftmost longest" incantation that should properly be used* to describe all regex matching. Bottom line: No amount of theoretic or pedagogic vaccination can build up your natural antibody resistance to this sort of thought-bug better than daily exposure to a wide variety of regex challenges. Good luck in your experimentation, and may you make many mistakes, for I know no better way to learn this stuff.

      * Old Whateley and his son Wilbur knew the dangers of incomplete and possibly maliciously altered incantations.


      Give a man a fish:  <%-{-{-{-<

Re^2: Simplifying regexes
by sundialsvc4 (Abbot) on Oct 26, 2015 at 17:27 UTC

    Parse::RecDescent does take some time to get to know, and I wish that the tutorials were better.   But the key concept is that it is very useful when you have a complex, naturally structured input.   An input in which it is easy to describe, in the form of regexes, the pieces of the greater whole, but now you lack a tool that will “piece them all together.”   A parser is such a tool.

    For example, consider the task of trying to build a single regex that will validate an arithmetic expression such as 1 * 2 + ( 3 * 4 ).   If you attempted to do this in one regex (and I have seen it done, e.g. as in RFC: Perl regex to validate arithmetic expressions, written four years ago), you quickly run into the problem that an arithmetic expression has a semantic structure.   It is not simply a stream of characters.   (For instance, 1 ) 2 * + 3 4 ( * is not a valid expression, even though it consists of the same nine so-called “tokens.”)

    A parser-driven approach would decompose the problem into two or more stages.   Regexes would be used to describe the individual tokens that make up the expression.   (There are nine tokens in this example.)   Then, the grammar would define how the tokens may legitimately appear together in a “valid” sequence.

    Parse::RecDescent takes an input which consists of, among other things, a grammar for your language and source-code that is to be textually included into the parser subroutine.   This is used to create an executable Perl subroutine behind-the-scenes which becomes the complete recognizer, or parser, for your language.   So you get the efficiency of a lean-and-mean Pure Perl subroutine that you did not have to entirely write from scratch.

    Every language-processing system ultimately uses this multi-level, lexer/parser driven approach on its front-end.   Perl, for example, uses (I think ...) the YACC = Yet Another Compiler-Compiler toolset as the first thing that it unleashes against your source-code.   At strategic points, the YACC-generated parser calls other routines within Perl that build the system’s “understanding” of what your source-code says.   This is Magickally Transformed into what ultimately drives the runtime language system ... which is (also) an automaton.   Structurally speaking, regex evaluation proceeds the same way, although the same tools are not typically used.

    Useful pages:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1146003]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (None)
    As of 2024-04-25 00:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found