Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Weirdness (duplicated data) while building result during parsing using regex

by perlancar (Hermit)
on Sep 01, 2016 at 15:45 UTC ( [id://1170982]=perlquestion: print w/replies, xml ) Need Help??

perlancar has asked for the wisdom of the Perl Monks concerning the following question:

This is a code I've trimmed down from Data::CSel to demonstrate the problem I'm having:
package CSelTest; use 5.020000; use strict; use warnings; our $RE = qr{ (?&ATTR_SELECTOR) (?{ $_ = $^R->[1] }) (?(DEFINE) (?<ATTR_SELECTOR> \[\s* (?{ [$^R, []] }) (?&ATTR_SUBJECTS) (?{ $^R->[0][1][0] = $^R->[1]; $^R->[0]; }) (?: ( \s*=\s*| #\s*!=\s*| # and so on \s+eq\s+ #\s+ne\s+ # and so on ) (?{ my $op = $^N; $op =~ s/^\s+//; $op =~ s/\s+$//; $^R->[1][1] = $op; $^R; }) (?: (?&LITERAL_NUMBER) (?{ $^R->[0][1][2] = $^R->[1]; $^R->[0]; }) ) )? \s*\] ) (?<ATTR_NAME> [A-Za-z_][A-Za-z0-9_]* ) (?<ATTR_SUBJECT> (?{ [$^R, []] }) ((?&ATTR_NAME)) (?{ push @{ $^R->[1] }, $^N; $^R; }) (?: # attribute arguments \s*\(\s* (?{ $^R->[1][1] = []; $^R; }) (?: (?&LITERAL_NUMBER) (?{ push @{ $^R->[0][1][1] }, $^R->[1]; $^R->[0]; }) (?: \s*,\s* (?&LITERAL_NUMBER) (?{ push @{ $^R->[0][1][1] }, $^R->[1]; $^R->[0]; }) )* )? \s*\)\s* )? ) (?<ATTR_SUBJECTS> (?{ [$^R, []] }) (?&ATTR_SUBJECT) (?{ push @{ $^R->[0][1] }, { name => $^R->[1][0], (args => $^R->[1][1]) x !!defined($^R->[1][1 +]), }; $^R->[0]; }) ) (?<LITERAL_NUMBER> ( -? (?: 0 | [1-9]\d* ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ [$^R, 0+$^N] }) ) ) # DEFINE }x; sub parse_csel { state $re = qr{\A\s*$RE\s*\z}; local $_ = shift; local $^R; eval { $_ =~ $re } and return $_; die $@ if $@; return undef; } 1;

This code tries to parse expression like [attr] or [attr=1] or [attr eq 1] which is similar to the CSS attribute selector.

% perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr] }) )' [[{ name => "attr" }]] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr=1] }) )' [[{ name => "attr" }], "=", 1] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr eq 1] }) )' [[{ name => "attr" }], "eq", 1]

No problem so far. Now, this code also recognizes the form [meth()] or [meth(1,2,3)] or [meth(1,2,3) = 1], which is recognizing an argument list after the attribute/method name. And this is where the problem happens:

% perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr()] }) )' [[{ args => [], name => "attr" }]] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr()=1] }) )' [[{ args => [], name => "attr" }], "=", 1] % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr() eq 1] }) )' do { my $a = [ [ { args => [], name => "attr" }, # .[0] { args => 'fix', name => "attr" }, # .[1] ], # [0] "eq", # [1] 1, # [2] ]; $a[0][1]{args} = $a[0][0]{args}; $a; }

As you can see, if I use the eq operator, (which is recognized by \s+eq\s+ part in the regex, notice the \s+ instead of \s*) instead of the = operator (which is recognized by \s*=\s* part in the regex, notice the \s* instead of \s+), I'm getting a duplicated section in the result (marked by the # .[1] comment.

I'm using perl 5.22.1 but have tried 5.24.0 as well as 5.25.4, with the same results.

Any hints?

UPDATE 2016-09-10: I worked around this problem by setting and incrementing counter variable in specific places to detect the backtracking and using conditional to avoid my code being executed multiple times in the case of backtracking. Thanks to everyone who provided responses.

Replies are listed 'Best First'.
Re: Weirdness (duplicated data) while building result during parsing using regex
by Krambambuli (Curate) on Sep 02, 2016 at 09:00 UTC
    If you add an 'use Data::Dump' to your module and insert lines like
    print "4 \$_: '$_'\n", "\$^N: ", dd( $^N ), "\$^R: ", dd( $^R );
    after your 'push' instructions, there is clearly a difference in behavior between your example cases.

    Maybe you can figure it out from there?
      Thanks for the suggestion, I indeed forgot to try "print"-debugging at every step. Apparently there's a backtracking involved in the case of [attr() eq 1] but not in the case of [attr()=1]. I should've been suspicious of backtracking whenever some kind of "duplication" happens. Will debug this further.
Re: Weirdness (duplicated data) while building result during parsing using regex
by kcott (Archbishop) on Sep 02, 2016 at 10:59 UTC

    G'day perlancar,

    Noting the "use 5.020000;" in your code, and the versions you used for testing, I attempted to replicate your results using a variety of versions. Unfortunately, other than "use 5.020000;" correctly barring a 5.18.0 test, pretty much everything else blew up in my face.

    My CSelTest.pm is a [download] copy of your posted code. I ran the tests like this:

    $ perl -MCSelTest -MData::Dump -E 'dd CSelTest::parse(q{ [attr] })'

    Here's the results:

    v5.18.0 : darwin-thread-multi-2level

    Successfully tested: use VERSION

    Perl v5.20.0 required--this is only v5.18.0, stopped at CSelTest.pm li +ne 3. BEGIN failed--compilation aborted at CSelTest.pm line 3. Compilation failed in require. BEGIN failed--compilation aborted.
    v5.20.0 : darwin-thread-multi-2level
    v5.20.2 : darwin-thread-multi-2level

    Both tests unsuccessful. Result: PANIC!

    panic: memory wrap at CSelTest.pm line 107. Compilation failed in require. BEGIN failed--compilation aborted.
    v5.22.0 : darwin-thread-multi-2level
    v5.24.0 : darwin-thread-multi-2level

    Both tests unsuccessful. Result: FATAL!

    Switch (?(condition)... not terminated in regex; marked by <-- HERE in + m/ (?&ATTR_SELECTOR) (?{ $_ = $^R->[1] }) (?(DEFINE) ... ) # DEFINE <-- HERE / at CSelTest.pm line 107. Compilation failed in require. BEGIN failed--compilation aborted.

    I suspect some sort of copy/paste error; please address.

    Here's another possible copy/paste error:

    % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr=1] }) )' [[{ name => "attr" }], "eq", 1]

    That's the same output as the next test with [attr eq 1]. For [attr=1], I would've expected '"="', rather than '"eq"', in the output; i.e.

    % perl -I. -Ilib -MCSelTest -MData::Dump -E'dd( CSelTest::parse_csel(q +{ [attr=1] }) )' [[{ name => "attr" }], "=", 1]

    — Ken

      Hi kcott,

      Thanks for commenting and testing. I've corrected the sample output as per what you said, my bad. But as for the code for CSelTest.pm itself, it looks correct. My diff -wu output comparing the downloaded code and the file on my filesystem is empty. In case you need to download from another source, I also put it on github:

      CSelTest.pm

      and (for comparison): CSelTest.pm-from-perlmonks.org

      About the "use 5.020000" pragma, I added it to Data::CSel to exclude perl 5.18.4 or earlier because CPAN Testers reported weird failures that look related to the regex engine and are something that I don't want to deal with at the moment. As far as I know, the regex-related constructs that I use (including (?{CODE}), (?&NAME), $^N, $^R, etc) are all supposed to be supported by 5.010 and up.

        I meant to comment on these in my first response.

        "About the "use 5.020000" pragma, ..."

        Seems like a good choice; however, the perldeltas (see below), for both v5.22 and v5.24, could have bug fixes, which may affect this choice. Having said that, you need to consider what versions of Perl are available to users of your code.

        [The use function documentation recommends "use 5.020_000;" (i.e. with underscore) as the preferred format, for reasons of backwards-compatibility. I also find it far easier to read.]

        "As far as I know, the regex-related constructs that I use (including (?{CODE}), (?&NAME), $^N, $^R, etc) are all supposed to be supported by 5.010 and up."

        Beyond being "supported", are you concerned with features being experimental? perlexperiment may be useful in this regard. It has "(?{code})": experimental in v5.006 (see perl56delta: Experimental features); accepted in v5.020 (compare perlre (v5.018_002) with perlre (v5.020_000)).

        The construct, "(?{code})", has been supported since v5.005_000 (see perl5005delta: Regular Expressions). In v5.018_000, perl5180delta: /(?{})/ and /(??{})/ have been heavily reworked.

        A quick way to gather this type of information, is to first find an @INC path with a pods subdirectory (I only found one: YMMV) and change to it:

        $ perl -E 'say for grep { -e && -d } map { $_ . q{/pods} } @INC' /six_dir_path/lib/5.24.0/pods $ cd /six_dir_path/lib/5.24.0/pods $

        Now search the *.pod files for the construct:

        $ grep -l '(?{' *.pod ... 13 delta pods; 14 other pods ...

        Update: Added " [Part 2 of 2]" to the title to differentiate this node, "Re^3: Weirdness (duplicated data) while building result during parsing using regex", from another of the same name, "Re^3: Weirdness (duplicated data) while building result during parsing using regex" (which will have " [Part 1 of 2]" appended).

        — Ken

        I repeated the same download procedure that I used originally and got different data.

        $ for i in md5 sha1; do openssl dgst -$i CSelTest.pm CSelTest.pm-20160 +902a_ORIGINAL_CODE; done MD5(CSelTest.pm)= 2a721b1bdc7e0ba01a9620c2cf61b171 MD5(CSelTest.pm-20160902a_ORIGINAL_CODE)= 2fc9805a057076e10117e0fc710f +6321 SHA1(CSelTest.pm)= 092f86a1c1c03523c2bb9f459ef353a757dfa8b6 SHA1(CSelTest.pm-20160902a_ORIGINAL_CODE)= 49bc50c3b30e817be950a9bcda9 +973baa5b85309

        The new code was four bytes shorter ...

        $ ls -l CSelTest.pm CSelTest.pm-20160902a_ORIGINAL_CODE -rw-r--r-- 1 ken staff 3224 3 Sep 00:16 CSelTest.pm -rw-r--r-- 1 ken staff 3228 2 Sep 19:14 CSelTest.pm-20160902a_ORIG +INAL_CODE

        ... due to replacing "et ai", wherever that came from, with presumably the missing closing parenthesis:

        $ diff CSelTest.pm CSelTest.pm-20160902a_ORIGINAL_CODE 43c43 < )? --- > et ai?

        Anyway, I ran your six tests and got the same results except for the second one: correctly output '"="', instead of '"eq"'.

        $ perl -MCSelTest -MData::Dump -E 'dd CSelTest::parse_csel(q{ [attr=1] + })' [[{ name => "attr" }], "=", 1]

        Update: Added " [Part 1 of 2]" to the title to differentiate this node, "Re^3: Weirdness (duplicated data) while building result during parsing using regex", from another of the same name, "Re^3: Weirdness (duplicated data) while building result during parsing using regex" (which will have " [Part 2 of 2]" appended).

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1170982]
Front-paged by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-20 11:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found