Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: solution wanted for break-on-spaces (w/specifics)

by tybalt89 (Monsignor)
on Oct 24, 2021 at 18:44 UTC ( [id://11137981]=note: print w/replies, xml ) Need Help??


in reply to solution wanted for break-on-spaces (w/specifics)

#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11137926 use warnings; use Data::Dump 'dd'; my @tests = ( # q{all '- and "-quotes properly balanced}, [ q{This is simple.}, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{ This is simple. }, [ q{This}, q{is}, q{simpl +e.} ] ], [ q{This is "so very simple".}, [ q{This}, q{is}, q{"so v +ery simple".} ] ], [ q{This "is so" very simple.}, [ q{This}, q{"is so"}, q{ +very}, q{simple.} ] ], [ q{This 'isn\'t nice.'}, [ q{This}, q{'isn\'t nice +.'} ] ], [ q{This "isn\"t nice."}, [ q{This}, q{"isn\"t nice +."} ] ], [ q{This 'isn\\\\'t nice.'}, [ q{This}, q{'isn\\\\'t}, + q{nice.'} ] ], [ q{This "isn\\\\"t nice."}, [ q{This}, q{"isn\\\\"t}, + q{nice."} ] ], [ q{This 'is not unnice.'}, [ q{This}, q{'is not unni +ce.'} ] ], [ q{This "is not unnice."}, [ q{This}, q{"is not unni +ce."} ] ], [ q{a "bb cc" d}, [ q{a}, q{"bb cc"}, q{d} + ] ], # q{UNbalanced '- and "-quotes at absolute end of string +}, [ q{This is "so very simple}, [ q{This}, q{is}, q{"so ver +y simple} ] ], [ q{This 'isn\'t nice.}, [ q{This}, q{'isn\'t nice.} + ] ], [ q{This "isn\"t nice.}, [ q{This}, q{"isn\"t nice.} + ] ], [ q{This 'isn\\\\'t nice.}, [ q{This}, q{'isn\\\\'t}, q +{nice.} ] ], [ q{This "isn\\\\"t nice.}, [ q{This}, q{"isn\\\\"t}, q +{nice.} ] ], [ q{This 'is not unnice.}, [ q{This}, q{'is not unnice +.} ] ], [ q{This "is not unnice.}, [ q{This}, q{"is not unnice +.} ] ], # 'what about these questionable cases?', [ q{is this"really so"simple now?}, [ q{is}, q{this"reall +y so"simple}, q{now?} ] ], [ q{is this"really so" now?}, [ q{is}, q{this"reall +y so"}, q{now?} ] ], [ q{is "really so"simple now?}, [ q{is}, q{"really so +"simple}, q{now?} ] ], [ q{is this'really so'simple now?}, [ q{is}, q{this'reall +y so'simple}, q{now?} ] ], [ q{is this'really so' now?}, [ q{is}, q{this'reall +y so'}, q{now?} ] ], [ q{is 'really so'simple now?}, [ q{is}, q{'really so +'simple}, q{now?} ] ], [ q{is really\\ so\\ simple now?}, [ q{is}, q{really\\ so +\\ simple}, q{now?} ] ], ); my $regex = qr/(?: '(?: \\. | [^'\\] )*' # single quoted string | "(?: \\. | [^"\\] )*" # double quoted string | ['"].* # unmatched quote | \\. # escaped character | \S # single non-space character )+/x; my $passcount = 0; for ( @tests ) { my ( $string, $want ) = @$_; my @out = $string =~ /$regex/g; local $" = "\0"x5; # just some array element boundary separator "@$want" eq "@out" ? $passcount++ : dd "$string => FAILED got", \@out, ' wanted ', $want; } print "$passcount of @{[scalar @tests]} passed\n";

Outputs:

25 of 25 passed

Replies are listed 'Best First'.
Re^2: solution wanted for break-on-spaces (w/specifics) (?>...)
by LanX (Saint) on Oct 24, 2021 at 23:03 UTC
      BTW, on the no-backtracking -- that was a later addition one of about 10-15 alterations in the statement I tried over time.
Re^2: solution wanted for break-on-spaces (w/specifics)
by perl-diddler (Chaplain) on Oct 26, 2021 at 16:25 UTC
    Your regex was perfect. FWIW, I put it in my original prog (some bugs fixed in the prog), as the 2nd regex in the regex array. The reason I had them and the outputs in arrays was to compare several RE's. But I ended up with just the one as it passed the most cases. So lines for cases 3 and 4 (w/4+5 being the two that didn't pass in the regex I originally posted)
    ResByLn:{ln=>3, wanted=>4, got=>[4, 4]},[" p ", " p "] ResByLn:{ln=>4, wanted=>2, got=>[3, 2]},["FAIL:<4>", " p "]
    The gots were count I got from the regex's, with your RE being in the 2nd position. The last brackets contained the p/f for each regex against that statement. So yours were 'p' straight down the 2nd column. Thanks. I had spaces in the earlier revisions of the re's, but I wasn't sure I had the 'x' flag applied to the sub-re's that needed them.

    I guess each outer layer of the RE's flags get propagated to inner RE's.

    I'm not sure if you were asking a question about your third group above where it you wrote: " 'what about these questionable cases?',"? I'm not sure what is questionable about them. In my use case, neither 'q{}' nor '?' have special meaning. Only the quotes and backslash were meta chars. So in the first line, I see 3 fields in both of the 1st 2 cases:

    [ q{is this"really so"simple now?}, [ q{is}, q{this"really so"simple}, + q{now?} ] ], ^ ^ ^ +^
    Both of expressions had 2 breaks -- yielding 3 parts in each. Does that make sense?

    One rule I forgot to list, though, at least your example handled it as expected, was what to do with overlapping quotes, and not making a quote of a different type have 'meta' properties. I.e.:

    this "is a' test" of weird' stuff
    I may be wrong but I don't think most here would split that into 3 parts, as most of us are used to meta-properties of characters being disabled or modified within quotes, so the single quote above wouldn't start a quoted sub-expression overlapping with double quoted part. That would effectively make "is a' test" of weird' all 1 "word" as all the spaces are between quotes of some type. While that would be "a" way of interpreting overlapping quoted sections, I don't know how expected or useful it would be. Need to study your example and some others, but wanted to make some response. Just that about 3-4 other things cropped up and need attention just after I posted this...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137981]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-04-18 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found