Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^4: strange behavior of JSON parsing guru regex

by seki (Monk)
on Apr 22, 2020 at 09:58 UTC ( [id://11115885]=note: print w/replies, xml ) Need Help??


in reply to Re^3: strange behavior of JSON parsing guru regex
in thread strange behavior of JSON parsing guru regex

What perl version?
As stated, my current working in on 5.28.0 (mac), but I was previously with 5.22.0 (mac/linux), while my needs is to work at least starting with 5.16.3. I am using plenv to switch easily.

Mind your warnings?
I never seen such a warning previously, and not currently. Seems to be related to variables in closures? I will try to understand that point, but my experience shows the problem is not related to this.

Use more subroutines?
Yes, that seems to have an impact on the result: I understand intuitively that when reducing the amount of code inlined in the regex with (?{ ... }), it moves forward the bound of the segfault. Tested by removing all tracing code, moving the most of actions into subs: the parsed that segfaulted with the 10 000 values can now parse it, but it fails now with a 16500 values. (The input file is always a single-line json without whitespaces with an array of bjects. Each object has 60 attributes of string type or null). I am thinking on a way to anonymize values in order to publish test data.
Note that I tried the original parser by Randall Schwartz : it segfaults also on mac with the 16 500 values file.
I wonder if there could have some "hidden trick" that could help to test by increasing the amount of memory allowed to the regex engine, but I don't know the guts parts.

Use more "Possessive quantifiers"?
Why not, while hacking in Perl for some time now, I never used them. I seem to remind it allows to reduce backtracking, right?

Here is the clean version of the parser, closest to the Randall Schwartz version, only using helper subs instead of the uncomprehensible $^R stack hack:

use strict; use warnings; use feature 'say'; use utf8; use open ':std', ':encoding(UTF-8)'; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0; # compact dump my $data = $ARGV[0] || '{"id":42}'; if (-f $data){ open my $f, '<', $data or die "cannot open $data: $@"; $data = do { local $/; <$f> }; close $f; } my $o = from_json($data); $Data::Dumper::Indent = 2; # fancy dump say Dumper $o; sub TRACE_JSON {0} sub eval_json_string { my $s = shift; $s =~ s/\\u([0-9A-Fa-f]{4})/\\x{$1}/g; $s =~ s/@/\\@/g; return eval $s; } my @eval_stack; sub dump_stack { say "stack is ",scalar(@eval_stack),' =>' , Dumper(\ +@eval_stack) } sub push_val { push @eval_stack, shift; } sub peek_val { my $idx = shift || -1; return $eval_stack[ $idx ]; } sub pop_val { return pop @eval_stack; } sub add_obj_val { my ($k,$v) = @_; $eval_stack[-1]->{$k} = $v; } sub add_arr_val { my $v = shift; push @{$eval_stack[-1]}, $v; } # Return a Perl structure corresponding to a json string sub from_json { my $i = '&#8901;'; # indent char my $l = 0; # indent level say "Initial stack is ", Dumper(\@eval_stack) if TRACE_JSON; my $rx = qr{ # NOTES: # this regex is a recusrive descent parser - see https://www.perlm +onks.org/?node_id=995856 # and chapter 1 "Recursive regular expressions" of Mastering Perl +(Brian d Foy) # # Inside the block (?(DEFINE) ...) (?<FOOBAR> ...) defines a name +d pattern FOOBAR # that can be called with (?&FOO +BAR) # (?{ ... }) is a block of Perl code that is evaluated at the time + we reach it while running the pattern # $^R is the value returned by the LAST runned (?{ }) block, so it + is overriden at each (?{ ... }) # if you want to run random code, remember to add $^R as last +statement to always keep the value # $^N is the last matched (non-anonymous) group (?&VALUE) (?{ $_ = pop_val() }) # <== entry point of the parser (?(DEFINE) # this does not try to match, it only defines a serie o +f named patterns (?<VALUE> \s*+ ( (?&STRING) | (?&NUMBER) | (?&OBJECT) | (?&ARRAY) | true (?{ push_val(1) }) | false (?{ push_val(0) }) | null (?{ push_val(undef) }) ) \s*+ ) (?<OBJECT> # will generate a Perl hash \{ # start of object (?{ push_val({}); }) # init structure \s*+ (?: (?&KV) # first pair (?{ my $v = pop_val(); my $k = pop_val(); add_obj_val($k, $ +v); }) (?: # additional pairs \s* , \s* (?&KV) (?{ my $v = pop_val(); my $k = pop_val(); add_obj_val($k, + $v); }) )* # additional pairs are optional )? # object may be empty \s*+ \} ) (?<KV> # tuple <key, value> (?&STRING) \s*+ : \s*+ (?&VALUE) (?{ }) ) (?<ARRAY> # will generate a Perl array \[ (?{ push_val([]); }) # init structure \s*+ (?: # first element (?&VALUE) (?{ my $v = pop_val(); add_arr_val( $v ) }) (?: # additional elements \s*+ , \s*+ (?&VALUE) (?{ my $v = pop_val(); add_arr_val( $v ) }) )* # additional elements are optional )? # array may be empty \s*+ \] # end of array ) (?<STRING> ( " (?: [^\\"]+ | \\ ["\\/bfnrt] | \\ u [0-9a-fA-f]{4} )*+ " ) (?{ my $v = eval_json_string($^N); push_val($v); }) ) (?<NUMBER> ( -? (?: 0 | [1-9]\d*+ ) (?: \. \d+ )? (?: [eE] [-+]? \d+ )? ) (?{ my $v = eval $^N; push_val($v); }) ) ) #End of DEFINE }xms; my $struct; { local $_ = shift; local $^R; eval { m{\A$rx\z}; } and $struct = $_; say "eval error: $@" if $@; } return $struct; }

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11115885]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2024-04-23 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found