To check for a space followed by a word character is simple, though there are a few similar patterns that might serve your needs best:
$string =~ / \w/; # a space followed by a word character
$string =~ /\s\w/; # any whitespace character followed by a word chara
+cter
$string =~ /\s\S/; # any whitespace character followed by a non-whites
+pace character
However, since you're applying a regex here, it might be just as efficient to go ahead and do the split and then see whether it split anything. That would take a bit more time on the lines that are a single word, but less time on the ones with multiple words:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
my @s = ('John', 'John ', 'John Doe', 'John P. Doe'); # last 2 should
+match
for (@s){
my @v = split /\s+\b/; # split on whitespace followed by a word bou
+ndary
if(@v > 1 ){ # if the split did any splitting
say; # do stuff with the line or elements
}
}
Update: I thought I'd benchmark it (code below), and found that if 50% of the values needed to be split as in the example above, the two methods were equally fast:
Rate split and check check and split
split and check 145/s -- -1%
check and split 146/s 1% --
But when I made it so 75% of the values needed to be split, the "split everything and then check for a second element" method was the clear winner:
Rate check and split split and check
check and split 112/s -- -17%
split and check 136/s 21% --
So it looks like if less than half your lines will need to be split, check first, then split the ones that matched. If more than half will end up being split, just split them all and check for a second element in the resulting array, and go from there. (Incidentally, checking for the second element ($v[1]) was also a gain over checking the number of elements (@v>1) as I originally did.) Here's the benchmarking code:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
use Benchmark qw(:all);
use Data::Printer;
# my @s = ('John', 'John ', 'John Doe', 'John P. Doe') x 1000; # big a
+rray 50% need split
my @s = ('John', 'John Poe', 'John Doe', 'John P. Doe') x 1000; # big
+array 75% need split
cmpthese( 1000, {
'split and check' => \&one,
'check and split' => \&two,
});
sub one {
for (@s){
my @v = split /\s+\b/; # split on a space followed by a word
+boundary
if($v[1] ){ # if the split did any splitting
# do stuff with the line or elements
}
}
}
sub two {
for (@s){
if (/\s\b/){ # if the line would be split
my @v = split /\s+\b/; # split it
# do stuff with the line or element
+s
}
}
}
Aaron B.
Available for small or large Perl jobs and *nix system administration; see my home node.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.