Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

More Power to your Regex

by robin (Chaplain)
on Apr 01, 2002 at 21:54 UTC ( [id://155869]=perlmeditation: print w/replies, xml ) Need Help??

I've just implemented an experimental regex extension, which allows you to do things like match balanced parentheses, without having to embed any code in your regex. The implementation isn't in perl (yet?), but this is still a firm Perl topic, and it raises questions about what features Perl's regexes ought to have.

I wrote a web page about it, but here's the rough idea. You can treat each parenthesised group in your regex as if it were a little subroutine, and make calls to it from within your regex. The power comes from the fact that you can make recursive calls, so you can write

/^([^()]|\((?1)*\))*$/
to match strings which have balanced parentheses, for example.

My favourite so far is a regex to detect palindromic sentences, ignoring spacing and punctuation.

/^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/i

If you want to try it out, follow the instructions here. Let me know what you think.

Replies are listed 'Best First'.
Re: More Power to your Regex
by Juerd (Abbot) on Apr 02, 2002 at 00:28 UTC

    I really, really like this approach, and think it should be implemented in Perl 6, or if possible even sooner. This is a much cleaner solution than the clumsy variable evaluation.

    Here's my try to match simple xml-like data:

    % ^ \s* ( # <1> < \s* ([a-zA-Z:]+) # <2/> (?: \s*[a-zA-Z:]* \s* = \s* (?:'[^']*'|"[^"]*") )* \s* (/\s*)? # <3/> > (?:[^<>]* | (?1))* # Update: added * to (?:) (?(3)| <\s*/\s*\2\s*> ) ) # </1> \s* $ %x
    This is not at all xml-compliant, but at least handles simple data.
    <foo><bar></bar></foo> # Match <foo><bar></foo></bar> # No match <foo><bar/></foo> # No match (WRONG) <foo><bar></foo> # No match <foo bar=baz/> # No match <foo bar="baz"> # No match <foo bar="baz"/> # Match < fooo / > # Match <foo/>foo # No match foo<foo/> # No match <foo>foo</foo> # Match <foo><bar/>foo</foo> # No match (WRONG) <a><b><c></c></b></a> # No match (WRONG!!)
    Could it be that backreferences (like \2) in this case or conditionals (like (?(3)) don't work the way they should, when your patch is used?.

    UPDATE The new version of robin's patch does parse the deeper recursion correctly. See also: Re: Recursive Regex: Update.

    U28geW91IGNhbiBhbGwgcm90MTMgY
    W5kIHBhY2soKS4gQnV0IGRvIHlvdS
    ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
    geW91IHNlZSBpdD8gIC0tIEp1ZXJk
    

      Very impressive! ++ I'm going to rebuild PCRE in debugging mode, and see if I can work out what's going wrong here. There might well be a bug in my patch; I've certainly never tested it with conditionals.

        I've certainly never tested it with conditionals.

        I think it's the conditional indeed, because it works smoothly when I re-write it to not use a conditional:

        % ^ \s* ( # <1> # Single tags like <foo/> < \s* [a-zA-Z:]+ (?: \s*[a-zA-Z:]* \s* = \s* (?:'[^']*'|"[^"]*") )* \s* /\s* > | # Tags in pairs like <foo>content</foo> < \s* ([a-zA-Z:]+) # <2/> (?: \s*[a-zA-Z:]* \s* = \s* (?:'[^']*'|"[^"]*") )* \s* > (?:[^<>]* | (?1))* <\s*/\s*\2\s*> ) # </1> \s* $ %x
        <foo><bar></bar></foo> # Match <foo><bar></foo></bar> # No match <foo><bar/></foo> # Match <foo><bar></foo> # No match <foo bar=baz/> # No match <foo bar="baz"> # No match <foo bar="baz"/> # Match < fooo / > # Match <foo/>foo # No match foo<foo/> # No match <foo>foo</foo> # Match <foo><bar/>foo</foo> # Match #<a><b><c></c></b></a> # No match (WRONG!!)
        Now, there's still the three-level-deep problem...

        U28geW91IGNhbiBhbGwgcm90MTMgY
        W5kIHBhY2soKS4gQnV0IGRvIHlvdS
        ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
        geW91IHNlZSBpdD8gIC0tIEp1ZXJk
        

Re: More Power to your Regex
by blackflag (Novice) on Apr 01, 2002 at 23:48 UTC
    You should see if there is a way you can get these functions into Perl 6. I'm sure that there is space for interesting (and useful) regexes like these that don't require you having to type in something as confusing as /^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/i

    Try emailing someone who is working on Perl 6 with your functions. Maybe you could write a module that includes all of these as subfunctions.

    Well, good luck with your work. I certainly enjoyed looking it over.

    Peace, BlackFlag.
      Well, I just typed up something that you might find interesting, its worth a look at least:

      print &quot;Type a word or phrase: &quot;; $line = &lt;STDIN&gt;; $line =~ s/\W//g; # removes space and nonalphanumerics $line =~ tr/A-Z/a-z/; # converts to lowercase. @letters = split //, $line; $reverse = join &quot;&quot;, reverse @letters; # gets the letter and reverse it if(@letters == 1){ print &quot;One letter palindrome, trivial!\n&quot;; } elsif($reverse eq $line){ print &quot;This is a palindrome.\n&quot;; } else{ print &quot;Not a palindrome.\n&quot;; }


      If you have any thoughts about this, be sure to /msg me. However, its much cooler to have a single regex do it than having all those lines of code. Peace... BlackFlag.
Re: More Power to your Regex
by belg4mit (Prior) on Apr 02, 2002 at 00:31 UTC
    *sigh* multi-tasking and never actually submitted my post...

    Excellent! When do we get to see it? I especially like the named parens bit, ja I know it's not your patch but it sounds like it'll likely come along for the ride?

    --
    perl -pe "s/\b;([st])/'\1/mg"

      You can see it now! Presumably you want to see it in Perl though, which is rather out of my hands. I posted to the perl6-language list last night, and it sounds as though Larry has some ideas of his own.

      It sounds as though we'll get something of equivalent power, one way or another. I eagerly await the next apocalypse.

Re: More Power to your Regex
by abstracts (Hermit) on Apr 02, 2002 at 03:43 UTC
    This is very interesting indeed, but makes me wonder: why are we still calling such expressions regular? I cannot think of a good name for it, but surely "regular" does not describe it.

    Can anybody think of some? (no cfgex for me --thank you very much :-)

      I've been wondering the same thing. I expect Kleene is turning in his grave :-)

      We've had the same problem ever since back referencing was invented, really. The computer scientist Alfred Aho (of dragon book fame) wrote an article in 1990 in which he tried to introduce the term rewbr for "regular expression with back referencing". Oddly, it doesn't seem to have caught on...

      Maybe we should just call them all patterns and be done with it. Or else stick with regex, and forget that it was ever an abbreviation,

        Or else stick with regex, and forget that it was ever an abbreviation,

        *Grin* All we have to do is pretend that regular refers to how often we use them than to their underlying mathematical properties...

        Yves / DeMerphq
        ---
        Writing a good benchmark isnt as easy as it might look.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://155869]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-29 13:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found