G'day perl-diddler,
Testing for the number of elements is a weak test; you really need qualitative tests as well.
In addition, that would have told us what you expected (and allowed better answers).
Your title has "break-on-spaces" (plural) but all your tests only use single spaces.
In my code below, I added an additional test to show that q{This is simple.} and q{ This is simple. }
both produce the same output.
I guessed that is what you would've wanted; if not, you'll need to advise us.
Writing code for purely academic reasons is absolutely fine; I do it myself.
Having said that, the regex you presented is unwieldy, difficult to read,
and maintenance would, I suspect, be an error-prone nightmare.
I've provided an alternative solution below which mostly just uses Perl's string handling functions.
When you have a working regex solution, I'd be interested to see a benchmark.
You indicated that you'd encountered problems with lines 4-7; and later amended that that to just 6-7.
I suspect you may have run into problems with escaping, particularly \\ and \\\\.
Take a look at my ok N lines 7-10: I've just made a guess at what I thought you wanted.
I've included most of your tests; you can, of course, add the remainder yourself.
I didn't see the benefit of tests 8 and 9;
and I thought that tests 10-15 potentially had issues with escaped backslashes
so its perhaps best to wait for clarification from you on that score.
Here's the code:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my @tests = (
[q{This is simple.}, [q{This}, q{is}, q{simple.}]],
[q{ This is simple. }, [q{This}, q{is}, q{simple.}]],
[q{This is "so very simple".}, [q{This}, q{is}, q{"so very simple"
+.}]],
[q{This "is so" very simple.}, [q{This}, q{"is so"}, q{very}, q{si
+mple.}]],
[q{This 'isn\'t nice.'}, [q{This}, q{'isn\'t nice.'}]],
[q{This "isn\"t nice."}, [q{This}, q{"isn\"t nice."}]],
[q{This 'isn\\'t nice.'}, [q{This}, q{'isn\\'t nice.'}]],
[q{This "isn\\"t nice."}, [q{This}, q{"isn\\"t nice."}]],
[q{This 'isn\\\\'t nice.'}, [q{This}, q{'isn\\\\'t}, q{nice.'}]],
[q{This "isn\\\\"t nice."}, [q{This}, q{"isn\\\\"t}, q{nice."}]],
[q{This 'is not unnice.'}, [q{This}, q{'is not unnice.'}]],
[q{This "is not unnice."}, [q{This}, q{"is not unnice."}]],
[q{a "bb cc" d}, [q{a}, q{"bb cc"}, q{d}]],
);
plan tests => 0+@tests;
for my $test (@tests) {
my ($raw_str, $exp) = @$test;
my $str = ($raw_str =~ /^\s*(.*?)\s*$/)[0];
my $got = [];
my $str_len = length $str;
my ($unbroken, $in_quote, $escape, $in_space)
= ('', '', 0, 0);
my $quote_re = qr{(['"])};
for my $str_index (0 .. $str_len - 1) {
my $char = substr $str, $str_index, 1;
if ($escape) {
$unbroken .= $char;
$escape = 0;
next;
}
if ($char eq qq{\\}) {
$escape = 1;
$unbroken .= $char;
next;
}
if ($char =~ $quote_re) {
my $quote = $char;
if ($in_quote) {
$in_quote = '' if $in_quote eq $quote;
}
else {
$in_quote = $quote;
}
$unbroken .= $char;
next;
}
if ($char eq ' ') {
next if $in_space;
if ($in_quote) {
$unbroken .= $char;
}
else {
$in_space = 1;
}
}
else {
$unbroken .= $char;
$in_space = 0;
next;
}
if ($in_space) {
push @$got, $unbroken;
$unbroken = '';
}
}
push @$got, $unbroken;
is_deeply($got, $exp, qq{<$raw_str>: } . join('|', @$exp));
}
Here's the output:
$ ./pm_11137926_str_parse.pl
1..13
ok 1 - <This is simple.>: This|is|simple.
ok 2 - < This is simple. >: This|is|simple.
ok 3 - <This is "so very simple".>: This|is|"so very simple".
ok 4 - <This "is so" very simple.>: This|"is so"|very|simple.
ok 5 - <This 'isn\'t nice.'>: This|'isn\'t nice.'
ok 6 - <This "isn\"t nice.">: This|"isn\"t nice."
ok 7 - <This 'isn\'t nice.'>: This|'isn\'t nice.'
ok 8 - <This "isn\"t nice.">: This|"isn\"t nice."
ok 9 - <This 'isn\\'t nice.'>: This|'isn\\'t|nice.'
ok 10 - <This "isn\\"t nice.">: This|"isn\\"t|nice."
ok 11 - <This 'is not unnice.'>: This|'is not unnice.'
ok 12 - <This "is not unnice.">: This|"is not unnice."
ok 13 - <a "bb cc" d>: a|"bb cc"|d
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.