Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re: solution wanted for break-on-spaces (w/specifics)

by kcott (Bishop)
on Oct 24, 2021 at 05:00 UTC ( #11137935=note: print w/replies, xml ) Need Help??

in reply to solution wanted for break-on-spaces (w/specifics)

G'day perl-diddler,

Testing for the number of elements is a weak test; you really need qualitative tests as well. In addition, that would have told us what you expected (and allowed better answers).

Your title has "break-on-spaces" (plural) but all your tests only use single spaces. In my code below, I added an additional test to show that q{This is simple.} and q{     This  is   simple. } both produce the same output. I guessed that is what you would've wanted; if not, you'll need to advise us.

Writing code for purely academic reasons is absolutely fine; I do it myself. Having said that, the regex you presented is unwieldy, difficult to read, and maintenance would, I suspect, be an error-prone nightmare. I've provided an alternative solution below which mostly just uses Perl's string handling functions. When you have a working regex solution, I'd be interested to see a benchmark.

You indicated that you'd encountered problems with lines 4-7; and later amended that that to just 6-7. I suspect you may have run into problems with escaping, particularly \\ and \\\\. Take a look at my ok N lines 7-10: I've just made a guess at what I thought you wanted.

I've included most of your tests; you can, of course, add the remainder yourself. I didn't see the benefit of tests 8 and 9; and I thought that tests 10-15 potentially had issues with escaped backslashes so its perhaps best to wait for clarification from you on that score.

Here's the code:

#!/usr/bin/env perl use strict; use warnings; use Test::More; my @tests = ( [q{This is simple.}, [q{This}, q{is}, q{simple.}]], [q{ This is simple. }, [q{This}, q{is}, q{simple.}]], [q{This is "so very simple".}, [q{This}, q{is}, q{"so very simple" +.}]], [q{This "is so" very simple.}, [q{This}, q{"is so"}, q{very}, q{si +mple.}]], [q{This 'isn\'t nice.'}, [q{This}, q{'isn\'t nice.'}]], [q{This "isn\"t nice."}, [q{This}, q{"isn\"t nice."}]], [q{This 'isn\\'t nice.'}, [q{This}, q{'isn\\'t nice.'}]], [q{This "isn\\"t nice."}, [q{This}, q{"isn\\"t nice."}]], [q{This 'isn\\\\'t nice.'}, [q{This}, q{'isn\\\\'t}, q{nice.'}]], [q{This "isn\\\\"t nice."}, [q{This}, q{"isn\\\\"t}, q{nice."}]], [q{This 'is not unnice.'}, [q{This}, q{'is not unnice.'}]], [q{This "is not unnice."}, [q{This}, q{"is not unnice."}]], [q{a "bb cc" d}, [q{a}, q{"bb cc"}, q{d}]], ); plan tests => 0+@tests; for my $test (@tests) { my ($raw_str, $exp) = @$test; my $str = ($raw_str =~ /^\s*(.*?)\s*$/)[0]; my $got = []; my $str_len = length $str; my ($unbroken, $in_quote, $escape, $in_space) = ('', '', 0, 0); my $quote_re = qr{(['"])}; for my $str_index (0 .. $str_len - 1) { my $char = substr $str, $str_index, 1; if ($escape) { $unbroken .= $char; $escape = 0; next; } if ($char eq qq{\\}) { $escape = 1; $unbroken .= $char; next; } if ($char =~ $quote_re) { my $quote = $char; if ($in_quote) { $in_quote = '' if $in_quote eq $quote; } else { $in_quote = $quote; } $unbroken .= $char; next; } if ($char eq ' ') { next if $in_space; if ($in_quote) { $unbroken .= $char; } else { $in_space = 1; } } else { $unbroken .= $char; $in_space = 0; next; } if ($in_space) { push @$got, $unbroken; $unbroken = ''; } } push @$got, $unbroken; is_deeply($got, $exp, qq{<$raw_str>: } . join('|', @$exp)); }

Here's the output:

$ ./ 1..13 ok 1 - <This is simple.>: This|is|simple. ok 2 - < This is simple. >: This|is|simple. ok 3 - <This is "so very simple".>: This|is|"so very simple". ok 4 - <This "is so" very simple.>: This|"is so"|very|simple. ok 5 - <This 'isn\'t nice.'>: This|'isn\'t nice.' ok 6 - <This "isn\"t nice.">: This|"isn\"t nice." ok 7 - <This 'isn\'t nice.'>: This|'isn\'t nice.' ok 8 - <This "isn\"t nice.">: This|"isn\"t nice." ok 9 - <This 'isn\\'t nice.'>: This|'isn\\'t|nice.' ok 10 - <This "isn\\"t nice.">: This|"isn\\"t|nice." ok 11 - <This 'is not unnice.'>: This|'is not unnice.' ok 12 - <This "is not unnice.">: This|"is not unnice." ok 13 - <a "bb cc" d>: a|"bb cc"|d

— Ken

Replies are listed 'Best First'.
Re^2: solution wanted for break-on-spaces (w/specifics)
by LanX (Sage) on Oct 24, 2021 at 09:44 UTC
    > Testing for the number of elements is a weak test; you really need qualitative tests as well.

    > I've included most of your tests;

    I think the best way to test this, is to create these strings from joining @expected arrays.

    By generating these arrays one can make sure to cover all edge cases.

    As a side product you'll define a formal grammar. Like:

    • how are unpaired quotes to be handled?
    • what about multiple whitespaces in a row?
    • what about multi-line input?
    • what about whitespace at start and end of string?
    It would also help testing sub-regexes individually.

    Crafting the strings by hand is error prone, because there are far too many cases to handle.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Why do I need qualitative tests? I just wanted to know if the RE's broke the line into the expected number of sections. The original test strings were read from a data file, which was several pared down representations of what one might find as attr-value fields after an initial xml or html element.

      How are unpaired quotes handled? That's really a bit undefined, but I thought terminating them at the end of the "string", would be most forgiving. For multi-whitespace -- I would assume shell semantics. Multi-line input -- in some larger more general case, lf+cr are both types of white space, but I didn't want to clutter my question and test cases. As for whitespace prefixes and suffixes -- in both cases, there is no "non-whitespace" before or after (respectivly) those, so they make no difference in the final answer.

      As I tried to stress, the program wasn't really important, it was just something I threw together over a few hours that grew by "whim", to test the regex's against the input lines in the test-data.txt file. It wasn't meant as a formal test harness.

        > Why do I need qualitative tests?

        because it's far from simple. Even defining the edge-cases isn't trivial.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137935]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-05-18 07:00 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (68 votes). Check out past polls.