Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Empty pattern in regex [updated]

by jo37 (Deacon)
on Oct 25, 2023 at 20:11 UTC ( [id://11155183] : note . print w/replies, xml ) Need Help??


in reply to Empty pattern in regex

I think it's a bug. It has nothing to do with the flip-flop operator and it seems to be caused by jumping out of a block. Consider this example that emulates a flip flop and uses goto instead of next.

#!/usr/bin/perl use v5.24; use warnings; my ($first, $last); while (<DATA>) { chomp; $first ||= /d/; undef($first) if $last ||= /h/; if ($first || $last) { undef $last; #goto ewhile unless //; goto eif unless //; say; eif: } ewhile: } __DATA__ c d e f g h i
goto eif: d h
goto ewhile: d f g h

Jumping to the end of the current block produces the expected result, while jumping to the end of the while loop reproduces choroba's strange results. The jump out of the block seems to clear the "last successful match" causing // to be taken as an always matching empty pattern.

However, I'd prefer to check the flip-flop's return value as this works in all circumstances, even for if(foo($_) .. bar($_)) {...}.

Update: 26.10.2023

Here is a much simpler example demonstrating the behaviour without any flip-flop behaviour. A jump out of a block transforms the empty pattern // from the last successful matching pattern to a true empty pattern.

#!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched // matched loop: e loop: f loop: g goto outer loop: c // matched loop: d /d/ matched // matched loop: e loop: f // matched loop: g // matched

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

Replies are listed 'Best First'.
Re^2: Empty pattern in regex [updated]
by perlboy_emeritus (Scribe) on Oct 27, 2023 at 21:54 UTC

    Hello jo37,

    Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples:

    #!/usr/bin/env -S perl -w ##!/usr/bin/env -S perl -wd use v5.30.0; use strict; use List::AllUtils qw( reduce ); my ($slurpee, $length, $sum); { local $/; ($slurpee) = <DATA>; } $length = length $slurpee; my @regexes = ( [ qr/[A-Z]/, "uppercase characte +rs", 0 ], [ qr/[a-z]/, "lowercase characte +rs", 0 ], [ qr/\d/, "digits", + 0 ], [ qr/\s/, "whitespace charact +ers", 0 ], # # Note: $ must be \$, and - must be first to avoid range interpretat +ion. # [ qr/[-~`!@#\$%^&*()_+={}\[\]|\\:;"'<>,.?\/]/, "punctuation charac +ters", 0 ], ); #for my $c (split //, $slurpee) { print $c; } for my $case (@regexes) { say "seeding // with: $case->[0]"; "Aa5: " =~ $case->[0]; # seed the // iteration say "matched: '$&'" if $&; for (split //, $slurpee) { // and $case->[2]++; } } for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]); +} $sum = reduce { $a + $b } (map $_->[2], @regexes); printf(" sum and length: %3d and %3d\n", $sum, $length); say "\nNow extract the string between HTML tags with //..."; my $str = "Before tag<i>between tags</i>after tag"; say "\n$str"; $str =~ s{ (?: (?<= \w) (?= <) | (?<= >) (?= \w) ) }{ }xg; # insert + whitespace say $str; my @tokens = split / /, $str; say "Tokens...\n"; for (@tokens) { say }; my $between; for (@tokens) { if (/<\w>/../<\/\w>/) { $between .= "$_ " unless // and $&; } } chop $between if $between; say "'$between'"; $str = "\n'Before tag<i>between tags</i>after tag'"; say $str; say "Parse it again with..."; my $regex = qr/ (<\w+>) (.*) (<\/\w+>) /x; say $regex; $str =~ $regex; say "\$1: '$1'"; say "\$2: '$2'"; say "\$3: '$3'"; exit(0); __END__ Last night I dreamt I went to Manderley again. This will come as a sur +prise to Daphne since she did not write these lines. Here is a line containing + stuff ,?- ! : that should/must be deleted/// ; : ! before using it as a o +ne-time-pad. A one-time-pad should contain only characters, no punctuation, no par +entheticals like (this is bogus) or [(this is bogus, too)], or {also +this}; no contractions, such as I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers, +such as $1,234.69. If you want to use numbers in your message, spell them out; one-hundred d +ollars and sixty-nine cents, or theeepm. These non-alpha characters +in the one-time-pad will be discarded, but they must be entered eactl +y as represented in the book used as the pad. Let the encoding progr +am decide what to use and what to skip. Some of the text is from "Rebecca", an out-of copyright but not out-of +-print fictional work that can be freely downloaded as an eBook from Project Gutenberg. + I use it as the raw source for one-time pads in a cryptologic research study; i.e., ex +tract potential pad bits from somewhere in the text, randomly chosen with seek from EO +F. Munge the characters, encrypt the message and delete the characters used for the + pad. Since both encoder and decoder use the same seek expression, both pads are guaran +teed to be identical, and since the characters used to create the pad are deleted +, never to be seen again, the pad is guaranteed to be used exactly once. Does not scale f +or large organizations but works flawlessly for a small group of conspirators.
    O U T P U T
      seeding // with: (?^u:A-Z)
      matched: 'A'
      seeding // with: (?^u:a-z)
      matched: 'a'
      seeding // with: (?^u:\d)
      matched: '5'
      seeding // with: (?^u:\s)
      matched: ' '
      seeding // with: (?^u:[-~`!@#\$%^&*()_+={}\\|\\:;"'<>,.?/])
      matched: ':'
        26 uppercase characters
      1168 lowercase characters
        13 digits
       283 whitespace characters
        80 punctuation characters
       sum and length: 1570 and 1570
    
      Now extract the string between HTML tags with //...
    
      Before tag<i>between tags</i>after tag
      Before tag <i> between tags </i> after tag
    
      Tokens...
    
      Before
      tag
      <i>
      between
      tags
      </i>
      after
      tag
      'between tags'
    
      'Before tag<i>between tags</i>after tag'
      Parse it again with...
      (?^ux: (<\w+>) (.*) (</\w+>) )
      $1: '<i>'
      $2: 'between tags'
      $3: '</i>'
    

      Hello perlboy_emeritus,

      to be more explicit in this issue, I do not only think it's a bug, I am absolutely convinced it is. Some remarks:

      • Having a workaround for a bug does in no way mean it is not a bug.
      • Using $& in this scenario is dangerous, as it is affected by the very same bug. See extended example below.
      • I cannot find anything in your code that would trigger the bug. This is fine and TIMTOWTDI
      • perlop is very precise in The empty pattern "//:
        If the *PATTERN* evaluates to the empty string, the last *successfully* matched regular expression is used instead. (...) If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). (...)
        As you can see from my example, // does not behave as described if there was a successful match and there happens a jump out of an inner block where // was applied. This clears $& and resets // to the genuine empty pattern.

      #!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; say "\$&: '$&'" if defined $&; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f $&: 'd' loop: g $&: 'd' goto outer loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f // matched loop: g // matched

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$