Re: Simplifying regexes

As other monks have pointed out, Perl's regular expressions are indeed anything but regular in the formal sense. Formal regular expression use only the following operations:

Concatenation: if a and b are regular expressions, then so is ab.
Alternation: if a and b are regular expressions, then so is (a|b).
The Kleene star: if a is a regular expression, then so is a*.

If you want to wrap your head around what these regular expressions really mean, though, it's best to start with formal languages, specifically regular languages. Regular expressions describe these languages in a very natural and obvious manner.

That said, while fascinating, none of this will help a lot with simplifying Perl regular expressions in a Perl program. For that, you'll need a good intuition of how regular expressions work in Perl, and for that you'll simply have to use them, time and again.

If you're not getting along with Mastering Regular Expressions, BTW, the chapter on regular expressions in Programming Perl is eminently readable and accessible.

Comment on Re: Simplifying regexes

Replies are listed 'Best First'.
Re^2: Simplifying regexes by ExReg (Priest) on Oct 26, 2015 at 17:23 UTC
Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions. It would appear that the path to enlightenment is through practice, and not pedagoguery.	[reply]
Re^3: Simplifying regexes by Laurent_R (Canon) on Oct 26, 2015 at 17:56 UTC
Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions. That's pretty much what I wanted to say. Yes, it is probably interesting to look at theoretical regular expression, but it will have little value for your practical problem, because our Perl "regexes" are anything but regular.	[reply]
Re^3: Simplifying regexes by AnomalousMonk (Archbishop) on Oct 27, 2015 at 08:55 UTC
It would appear that the path to enlightenment is through practice, and not pedagoguery. This has been my experience. "Regular expressions," as they have evolved (into ir-regular expressions, as others have noted), are the most counterintuitive things I have encountered in programming. Here's my favorite example of this conceptual orneriness. What will the regex `/(b)/` capture when run against the string `'aaaaabbbb'` and where (i.e., at what character position offset)? Knowing as we do that `` matches "as much as possible" of something, I'd almost be willing to bet money that even the most regex-savvy will experience at least a minor knee-jerk twitch in the general direction of "it matches `'bbbb'` at offset 5." But surprise, surprise: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'aaaaabbbb'; print qq{matches '$1' at offset $-[1]} if $s =~ /(b)/; " matches '' at offset 0` [download] It matches our old friend the empty string at a location as far away from any `'b'` as it could possibly get. The answer to the puzzle is that "as much as possible" cheats and leaves out the leftmost and equally important part of the "leftmost longest" incantation that should properly be used^ to describe all regex matching. Bottom line: No amount of theoretic or pedagogic vaccination can build up your natural antibody resistance to this sort of thought-bug better than daily exposure to a wide variety of regex challenges. Good luck in your experimentation, and may you make many mistakes, for I know no better way to learn this stuff. * Old Whateley and his son Wilbur knew the dangers of incomplete and possibly maliciously altered incantations. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Simplifying regexes by sundialsvc4 (Abbot) on Oct 26, 2015 at 17:27 UTC
Parse::RecDescent does take some time to get to know, and I wish that the tutorials were better. But the key concept is that it is very useful when you have a complex, naturally structured input. An input in which it is easy to describe, in the form of regexes, the pieces of the greater whole, but now you lack a tool that will “piece them all together.” A parser is such a tool. For example, consider the task of trying to build a single regex that will validate an arithmetic expression such as `1 * 2 + ( 3 * 4 )`. If you attempted to do this in one regex (and I have seen it done, e.g. as in RFC: Perl regex to validate arithmetic expressions, written four years ago), you quickly run into the problem that an arithmetic expression has a semantic structure. It is not simply a stream of characters. (For instance, `1 ) 2 * + 3 4 ( ` is not a valid expression, even though it consists of the same nine so-called “tokens.”) A parser-driven approach would decompose the problem into two or more stages. Regexes would be used to describe the individual tokens* that make up the expression. (There are nine tokens in this example.) Then, the grammar would define how the tokens may legitimately appear together in a “valid” sequence. Parse::RecDescent takes an input which consists of, among other things, a grammar for your language and source-code that is to be textually included into the parser subroutine. This is used to create an executable Perl subroutine behind-the-scenes which becomes the complete recognizer, or parser, for your language. So you get the efficiency of a lean-and-mean Pure Perl subroutine that you did not have to entirely write from scratch. Every language-processing system ultimately uses this multi-level, lexer/parser driven approach on its front-end. Perl, for example, uses (I think ...) the YACC = Yet Another Compiler-Compiler toolset as the first thing that it unleashes against your source-code. At strategic points, the YACC-generated parser calls other routines within Perl that build the system’s “understanding” of what your source-code says. This is Magickally Transformed into what ultimately drives the runtime language system ... which is (also) an automaton. Structurally speaking, regex evaluation proceeds the same way, although the same tools are not typically used. Useful pages: http://biteresources.com/resources/computing/A2/regular_expressions.pdf http://osteele.com/tools/reanimator/ (requires Flash) http://www.tattvum.com/regular-expressions-and-compilers and, certainly not least, https://en.wikipedia.org/wiki/Regular_expression, which has an entire section on implementations and running-times.


Your skill will accomplish what the force of many cannot
	PerlMonks