Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Complex regex question

by Cody Fendant (Hermit)
on Sep 25, 2019 at 06:06 UTC ( [id://11106665]=perlquestion: print w/replies, xml ) Need Help??

Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

Trying to match a pattern which is roughly a-or-b-followed by digits or digits-followed-by-a-or-b.

use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 0; my $string_one = 'foo bar [a21] plus (b23) baz bax'; my @string_one_results = $string_one =~ m { (a|b) (\d+) }gx; print "RESULTS: " , Dumper(\@string_one_results), $/; # RESULTS: $VAR1 = ['a','21','b','23'];

So far so good for the first pattern. Here's the other

my $string_two = 'foo bar [21a] plus (23b) baz bax'; my @string_two_results = $string_two =~ m{ (\d+) (a|b) }gx; print "RESULTS: " , Dumper(\@string_two_results), $/; # RESULTS: $VAR1 = ['21','a','23','b'];

The second regex matches fine for that other format. Here's what I've been struggling over for half an hour or so: what if the two formats are mixed up?

Shouldn't I be able to come up with a regex matching either pattern?

my $string_three = 'foo bar [21a] plus (b23) baz bax'; my @string_three_results = $string_three =~ m{ (\d+) (a|b) | (a|b) (\d ++) }gx; print "RESULTS: " , Dumper(\@string_three_results), $/; # RESULTS: $VAR1 = ['21','a',undef,undef,undef,undef,'b','23'];

Please help me out, Monks. What am I doing wrong and what are the undefs in my @results?

Replies are listed 'Best First'.
Re: Complex regex question
by Athanasius (Archbishop) on Sep 25, 2019 at 06:39 UTC

    Hello Cody Fendant,

    Your regex is actually working correctly: in the alternation m{ (\d+) (a|b) | (a|b) (\d+) }gx, the 4 captures are assigned to $1, $2, $3, and $4, even though only 2 of the 4 matches are possible in any particular instance (hence, you will always get 2 undef results). To avoid this behaviour, you can use the Extended Pattern (?|pattern) (from Perl 5.10.0 on):

    use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 0; my $string_three = 'foo bar [21a] plus (b23) baz bax'; my @string_three_results = $string_three =~ / (?| (\d+) (a|b) | (a|b) +(\d+) ) /gx; print 'RESULTS: ', Dumper(\@string_three_results), $/;

    Output:

    16:38 >perl 2020_SoPW.pl RESULTS: $VAR1 = ['21','a','b','23']; 16:38 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Complex regex question
by AnomalousMonk (Archbishop) on Sep 25, 2019 at 06:47 UTC

    And of course, the other approach (if you don't have 5.10+) is to just grep away all undefined values:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $s = 'foo [21a] plus (b23) baz'; ;; my @captures = grep defined, $s =~ m{ (\d+) (a|b) | (a|b) (\d+) }xmsg ; dd \@captures; " [21, "a", "b", 23]


    Give a man a fish:  <%-{-{-{-<

Re: Complex regex question
by tybalt89 (Monsignor) on Sep 25, 2019 at 14:16 UTC

    Maybe not the best solution :)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11106665 use warnings; my $string_three = 'foo bar [21a] plus (b23) baz bax'; my @string_three_results = $string_three =~ m{ ( [ab] | \d+ ) ( (??{$1 lt 'a' ? '[ab]' : '\\d+' }) ) }gx; use Data::Dump 'dd'; dd \@string_three_results;

    Outputs:

    [21, "a", "b", 23]
Re: Complex regex question
by AnomalousMonk (Archbishop) on Sep 25, 2019 at 16:22 UTC

    Yet another way. Pre-5.10 compatible; avoids explicit captures. Unfortunately, rather heavy on the look-arounds.

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $s = qq{a aa 21a a 123 a b9 b bb aa1 1aa @ARGV}; dd $s; ;; my @captures = $s =~ m{ \b \d+ (?= (?: a|b) \b) | (?<= \b (?: a|b)) \d+ \b | \b (?: a|b) (?= \d) | (?<= \d) (?: a|b) \b }xmsg; dd \@captures; " "aa11 11aa [b42] (12a)" "a aa 21a a 123 a b9 b bb aa1 1aa aa11 11aa [b42] (12a)" [21, "a", "b", 9, "b", 42, 12, "a"]


    Give a man a fish:  <%-{-{-{-<

      One more variant:
      #!/usr/bin/perl # https://perlmonks.org/?node_id=11106665 use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 0; my $string_three = "a aa 21a a 123 a b9 b bb aa1 1aa aa11 11aa [b42] ( +12a)"; my @captures = $string_three =~ m{ \b ([ab])? (\d+) ([ab])? \b (?(?{ 1 != grep defined, $1, $3 }) (*FAIL) ) }xmsg; print "RESULTS: ", ( Dumper \@captures ), "\n";
      RESULTS: $VAR1 = [undef,'21','a','b','9',undef,'b','42',undef,undef,'1 +2','a'];
      It generates triplets with one undefined value ($1 or $3), but you can grep defined values as AnomalousMonk mensioned.

      EDIT:
      Sorry for mistake. I changed condition "2 == grep defined, $1, $3" to "1 != grep defined, $1, $3", because one of $1 or $3 must be defined.


      ADDED:
      my @captures = $string_three =~ m{ \b ([ab])? (\d+) (?(1) | ([ab]) ) \b }xmsg;
      Similar approach using conditional. It captures in same way.

      ADDED-2:
      Also:
      my @captures = $string_three =~ m{ \b ([ab])? (\d+) (?(1) (*ACCEPT) ) ([ab]) \b }xmsg;
Re: Complex regex question
by beech (Parson) on Sep 25, 2019 at 06:45 UTC

    Hi

    words boundaries?

    $string_three =~ m{\b ( [ab] \d+ | \d+ [ab] ) \b }gx

    what are the undefs in my @results?

    List context for match groups

      But Cody Fendant seems to want the alpha and numeric fields extracted separately. The  m{\b ( [ab] \d+ | \d+ [ab] ) \b }gx regex requires a subsequent (albeit simple) step to achieve this:

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $string_three = 'foo bar [21a] plus (b23) baz bax'; my @string_three_results = $string_three =~ m{\b ( [ab] \d+ | \d+ [ab +] ) \b }gx; dd \@string_three_results; " ["21a", "b23"]


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11106665]
Approved by Athanasius
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-19 08:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found