Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

When a regexp with /g needs to be run .. twice

by talexb (Canon)
on Mar 02, 2018 at 21:13 UTC ( #1210258=perlquestion: print w/replies, xml ) Need Help??

talexb has asked for the wisdom of the Perl Monks concerning the following question:

This is interesting (to me). I'm cleaning up some JSON so that it's palatable to the API I'm using, and I'm discovering some unexpected behaviour.

$foo = ':{"' $foo =~ s/([:,{])(.)/$1 $2/g; print "'$foo'\n";
I'm expecting this regexp to insert spaces after all occurrences of left square bracket, colon and left brace. The result:
': {"'
If I run the regexp again, I finally get the result I'm expecting:
': { "'
That's counter-intuitive to me .. if I put 'g' at the end of a regexp, I expect it to run that expression repeatedly (flashback to COBOL .. REPEAT UNTIL DONE).

Thoughts?

Alex / talexb / Toronto

Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Replies are listed 'Best First'.
Re: When a regexp with /g needs to be run .. twice
by haukex (Bishop) on Mar 02, 2018 at 21:26 UTC
    $foo = ':{"' $foo =~ s/([:,{])(.)/$1 $2/g;

    The first match of the regex matches :{, and then the regex engine continues looking after that match, but the only thing there is ", which does not match your regex, which expects two characters.

    This works (although not yet tested on a lot of different inputs): s/(?<=[:,{])(?=.)/ /g (using Zero-Width Lookaround Assertions)

    occurrences of left square bracket, colon and left brace

    Left square brackets, or commas? You've got [:,{]

    Of course, if you're cleaning up JSON, it might be better to just round trip it through a module that can pretty-print it?

Re: When a regexp with /g needs to be run .. twice
by Laurent_R (Canon) on Mar 02, 2018 at 22:00 UTC
    The g modifier does not tell the regex engine to start matching again from the beginning, but to continue matching from where you've arrived.

    The following regex:

    $foo =~ s/([:,{])(.)/$1 $2/g;
    is matching two characters, i.e. :{, so that the regex engine will continue on the next character, i.e. ", which cannot be matched by the regex. So, in short, your regex is not applied a second time on your input string. It has gotten too far already in your input string.

    If you want the regex engine to repeatedly match each of the input characters, please remove the (.) part:

    $foo =~ s/([:,{])/$1 /g;
    For example (demonstrated under the Perl debugger):
    DB<1> $foo = ':{"'; DB<2> $foo =~ s/([:,{])/$1 /g; DB<3> print $foo : { "
    Update:: this is essentially the same solution as toolic's proposal, I did not notice until I reviewed the thread after having posted my response.
Re: When a regexp with /g needs to be run .. twice
by Athanasius (Archbishop) on Mar 03, 2018 at 09:25 UTC

    Hello talexb,

    From the end of perlop#Regexp-Quote-Like-Operators:

    Occasionally, you can't use just a /g to get all the changes to occur that you might want. Here are two common cases:

    # put commas in the right places in an integer 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/g; # expand tabs to 8-column spacing 1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e;

    (Note that the /g modifier in the first example isn’t actually needed.) You can adapt this approach to your requirements:

    my $foo = ':{,"'; 1 while $foo =~ s/([:,{])(?! )(.)/$1 $2/; print ">$foo<\n";

    Output:

    19:22 >perl 1872_SoPW.pl >: { , "< 19:22 >

    Hope that’s of interest,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: When a regexp with /g needs to be run .. twice
by toolic (Bishop) on Mar 02, 2018 at 21:28 UTC
    Your regex captures the colon into $1 and the left curly brace into $2 the first time through. The g modifier makes it attempt to capture 2 more characters AFTER the left curly brace, but there is only one character remaining (double quote), so the match fails. Just capture one matching character:
    $foo =~ s/([:,{])/$1 /g;
Re: When a regexp with /g needs to be run .. twice
by tybalt89 (Prior) on Mar 02, 2018 at 21:26 UTC

    missing ';' on first line.

      Ugh .. thanks. Copy and pasted from the debugger.

      Alex / talexb / Toronto

      Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: When a regexp with /g needs to be run .. twice (use re 'debug';)
by Anonymous Monk on Mar 02, 2018 at 23:21 UTC
    see re and  use re 'debug'; it works best with short programs/regex like this

    for step through use rxrx

Re: When a regexp with /g needs to be run .. twice
by QM (Parson) on Mar 05, 2018 at 10:07 UTC
    $foo = ':{"'; $foo =~ s/([:,{])(.)/$1 $2/g; print "'$foo'\n";

    If you just want a space after each of those, but not if it's the last char in a string:

    1 while $foo =~ s/([:,{])([^ ])/$1 $2/g; # [^ ] is "not space"

    If you want it after the last char in a string, you'll need something like:

    1 while $foo =~ s/([:,{])([^ ]|$)/$1 $2/g;

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      But no while-loop at all is needed (and no captures) if lookarounds are used. (I'm substituting  '.' rather than a space in the examples below for greater clarity — I hope.)

      Inserting something after each character in the class including when it's at the end of the string:

      c:\@Work\Perl\monks>perl -wMstrict -le "my @foos = (':X,{X', ':X,{'); ;; for my $foo (@foos) { printf qq{'$foo' -> }; $foo =~ s( (?<= [:,{]) ){.}xmsg; print qq{'$foo'}; } " ':X,{X' -> ':.X,.{.X' ':X,{' -> ':.X,.{.'
      Inserting something after each character in the class except when it's at the end of the string:
      c:\@Work\Perl\monks>perl -wMstrict -le "my @foos = (':X,{X', ':X,{'); ;; for my $foo (@foos) { printf qq{'$foo' -> }; $foo =~ s( (?<= [:,{]) (?= .) ){.}xmsg; print qq{'$foo'}; } " ':X,{X' -> ':.X,.{.X' ':X,{' -> ':.X,.{'
      (These examples work with Perl 5.8. Note that with the Perl 5.10+ regex  \K operator, the  (?<= [:,{]) terms in both examples can simplify to  [:,{] \K instead.)

      Update: Note also that in all cases  (?! \z) can perhaps better express the "except at the end of the string" requirement than  (?= .) — and may even be slightly faster!


      Give a man a fish:  <%-{-{-{-<

Re: When a regexp with /g needs to be run .. twice
by Anonymous Monk on Mar 05, 2018 at 03:43 UTC
    If you are "cleaning up JSON," can you maybe use an existing Perl JSON implementation to read the existing data successfully? If the Perl library could be forgiving of what it sees and correct in how it interprets it, then you would have the data correct in-memory. Then maybe then you could use that library to then write-out a new JSON file that does conform to standards. This alternative approach, if successful, might be a great sight easier than the "cleaning up" approach that you are now attempting. Worth a try ...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210258]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2021-04-13 06:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?