http://qs321.pair.com?node_id=11120118

leszekdubiel has asked for the wisdom of the Perl Monks concerning the following question:

I parse string using /gc and \G regexes. Current solution for skipping white characters and comments starting from # to the end of line is:

$$a =~ /\G(?=(\s|#))(?:\s++|#.*+)++/gc;

Look ahead ?= assertion is necessary, because zero-length matches cause problems. That solution unfortunatelly gives error:

Complex regular subexpression recursion limit (32766) exceeded

I have changes to such one:

while ($$a =~ /\G(?:\s++|#.*+)/gc) {};

but this is not elegant. I have changed "+" to "*" after "\s" and this also solves the problem, but I don't know why...

$$a =~ /\G(?=(\s|#))(?:\s*+|#.*+)++/gc;

Questions: (1) what is the better solution to strip white chars and commments (2) why * instead of + solves the problem?

Replies are listed 'Best First'.
Re: Simple way to skip spaces and # comments
by ikegami (Pope) on Jul 31, 2020 at 14:18 UTC

    Treat "#" and the following characters as a single whitespace character.

    /\G (?: \s | \# .* )++ /xgc

    More efficient?

    /\G (?: \s++ | \# .*+ )++ /xgc
      /\G (?: \s++ | \# .*+ )++ /xgc

      ^^^^ This makes error about recursion limit...

      # for f in `seq 40123`; do echo " #alfa beta"; done | perl -e 'use str +ict; use warnings; undef $/; my $s = <STDIN>; print length $s, "\n"; +$s =~ /\G (?: \s++ | \# .*+ )++ /xgc; print pos $s, "\n"; ' 481476 Complex regular subexpression recursion limit (32766) exceeded at -e l +ine 1, <STDIN> chunk 1. 196597

        If certain things are expected to match more than 32766 times, you need to break it down.

        So if the following exceeds the limit,

        a+
        you have to use
        (?:a{1,32766})+
        So,
        /\G (?: \s++ | \# .*+ )++ /xgc
        becomes
        /\G (?: (?: \s++ | \# .*+ ){1,32766}+ )+ /xgc
        Or maybe even
        /\G (?: (?: (?: \s{1,32766}+ )++ | \# (?: .{1,32766}+ )*+ ){1,32766}+ )+ /xgc
Re: Simple way to skip spaces and # comments
by perlfan (Vicar) on Jul 31, 2020 at 11:28 UTC
    I usually just s/// that kind of thing,
    my $line = 'valid text blah blah # delete me and everything after poun +d # also this'; $line =~ s/#.*$//; print qq{"$line"\n};
    results in:
    "valid text blah blah "
    I don't have an answer for your #2 question at the end.

      To be more precise... $a is a reference to string, that is beeing parsed. pos $$a shows current parsing position. From that position I need to skip all spaces and comments that extend maybe on many lines in that string. Positve zero-length match should be avoided.

        >Questions: (1) what is the better solution to strip white chars

        Asked and answered. \_(ツ)_/