Check for Spaces in a String

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Check for Spaces in a String by toolic (Bishop) on Jun 15, 2015 at 18:25 UTC
`use warnings; use strict; while (<DATA>) { if (/\w\s+\w/) { print "yes\n"; } else { print "no\n"; } } __DATA__ John Doe Joe John D John John` [download] Outputs: `yes yes no no` [download]	[reply] [d/l] [select]
Re^2: Check for Spaces in a String by Anonymous Monk on Jun 15, 2015 at 18:29 UTC
What about this: `$name=~/\s+\w+/`	[reply] [d/l]
Re: Check for Spaces in a String by aaron_baugher (Curate) on Jun 15, 2015 at 20:52 UTC
To check for a space followed by a word character is simple, though there are a few similar patterns that might serve your needs best: `$string =~ / \w/; # a space followed by a word character $string =~ /\s\w/; # any whitespace character followed by a word chara +cter $string =~ /\s\S/; # any whitespace character followed by a non-whites +pace character` [download] However, since you're applying a regex here, it might be just as efficient to go ahead and do the split and then see whether it split anything. That would take a bit more time on the lines that are a single word, but less time on the ones with multiple words: `#!/usr/bin/env perl use 5.010; use strict; use warnings; my @s = ('John', 'John ', 'John Doe', 'John P. Doe'); # last 2 should +match for (@s){ my @v = split /\s+\b/; # split on whitespace followed by a word bou +ndary if(@v > 1 ){ # if the split did any splitting say; # do stuff with the line or elements } }` [download] Update: I thought I'd benchmark it (code below), and found that if 50% of the values needed to be split as in the example above, the two methods were equally fast: `Rate split and check check and split split and check 145/s -- -1% check and split 146/s 1% --` [download] But when I made it so 75% of the values needed to be split, the "split everything and then check for a second element" method was the clear winner: `Rate check and split split and check check and split 112/s -- -17% split and check 136/s 21% --` [download] So it looks like if less than half your lines will need to be split, check first, then split the ones that matched. If more than half will end up being split, just split them all and check for a second element in the resulting array, and go from there. (Incidentally, checking for the second element `($v[1])` was also a gain over checking the number of elements `(@v>1)` as I originally did.) Here's the benchmarking code: #!/usr/bin/env perl use 5.010; use strict; use warnings; use Benchmark qw(:all); use Data::Printer; # my @s = ('John', 'John ', 'John Doe', 'John P. Doe') x 1000; # big a +rray 50% need split my @s = ('John', 'John Poe', 'John Doe', 'John P. Doe') x 1000; # big +array 75% need split cmpthese( 1000, { 'split and check' => \&one, 'check and split' => \&two, }); sub one { for (@s){ my @v = split /\s+\b/; # split on a space followed by a word +boundary if($v[1] ){ # if the split did any splitting # do stuff with the line or elements } } } sub two { for (@s){ if (/\s\b/){ # if the line would be split my @v = split /\s+\b/; # split it # do stuff with the line or element +s } } } [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l] [select]
Re: Check for Spaces in a String by kcott (Archbishop) on Jun 16, 2015 at 13:20 UTC
You may be better off doing the initial check without using the regex engine; only using it with split where necessary. As you can see from ++aaron_baugher's analysis, your results will depend on your real data. Furthermore, if your volume of data is small, your choice of solution may make little difference (in terms of runtime). Here's a solution using substr, rindex and length for the initial check. As a proof-of-concept to show that these functions work on characters (as opposed to bytes), I've included single-byte and multi-byte characters in the data. `#!/usr/bin/env perl -l use strict; use warnings; use utf8; use open OUT => ':utf8', ':std'; my @strings = ( 'A B C', 'D E', 'F ', 'G', 'H ', 'I ', "\N{MERCURY} \N{FEMALE SIGN} \N{EARTH}", "\N{MALE SIGN} \N{JUPITER +}", "\N{SATURN} ", "\N{URANUS}", "\N{NEPTUNE} ", "\N{PLUTO} ", ); for (@strings) { next if substr($_, -1, 1) eq ' ' or rindex($_, ' ', length() - 2) +== -1; print; }` [download] Output: A B C D E ☿ ♀ ♁ ♂ ♃ [The Unicode range of characters labelled "Astrological symbols" is `0x263d` to `0x2647`. There is no charname for "`VENUS`" or "`MARS`"; the charnames "`FEMALE SIGN`" and "`MALE SIGN`" are defined for these symbols, respectively.] Here's a benchmark test. This uses my sample data. If you choose this route, you should benchmark with representative samples of your data. #!/usr/bin/env perl -l use strict; use warnings; use Benchmark qw{cmpthese}; my @strings = ( 'A B C', 'D E', 'F ', 'G', 'H ', 'I ', "\N{MERCURY} \N{FEMALE SIGN} \N{EARTH}", "\N{MALE SIGN} \N{JUPITER +}", "\N{SATURN} ", "\N{URANUS}", "\N{NEPTUNE} ", "\N{PLUTO} ", ); my $re = qr{\s+\b}; cmpthese -1 => { no_re_check_and_split => \&no_re_check_and_split, re_check_and_split => \&re_check_and_split, split_and_re_check => \&split_and_re_check, }; sub no_re_check_and_split { for (@strings) { next if substr($_, -1, 1) eq ' ' or rindex($_, ' ', length() - + 2) == -1; my @parts = split /$re/; } } sub re_check_and_split { for (@strings) { next unless /$re/; my @parts = split /$re/; } } sub split_and_re_check { for (@strings) { my @parts = split /$re/; next if @parts > 1; } } [download] Here's three sample runs: Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 50243/s -- -18% + -29% re_check_and_split 61265/s 22% -- + -13% no_re_check_and_split 70274/s 40% 15% + -- Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 53593/s -- -19% + -27% re_check_and_split 66370/s 24% -- + -10% no_re_check_and_split 73770/s 38% 11% + -- Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 53096/s -- -20% + -27% re_check_and_split 66369/s 25% -- + -8% no_re_check_and_split 72404/s 36% 9% + -- ken@ganymede: ~/tmp [download] With my sample data, doing the initial check without a regex appears faster. Again, I'll stress, you'll need to check with your data. -- Ken	[reply] [d/l] [select]
Re: Check for Spaces in a String by talexb (Chancellor) on Jun 16, 2015 at 18:36 UTC
In my opinion, you should only be using a regular expression when the simpler solutions can't handle the problem. In this case, you can manage by just using `split`. Here's how: #!/usr/bin/perl use strict; use warnings; { my @names = ( [ 'John Doe Joe', 1 ], [ 'John D', 1 ], [ 'John ', 0 ], [ 'John', 0 ], ); foreach my $lr (@names) { my ( $name, $success ) = @{$lr}; my @result = split( /\s/, $name ); if ( ( @result > 1 && $success ) \|\| ( @result == 1 && !$success ) ) { print "$name split correctly, list has " . ( scalar @result ) . " elements. - "; } else { print "$name split incorrectly, list has " . ( scalar @result ) . " elements. - "; } print join('\|',@result) . "\n"; } } [download] Rather than only using `split` after you've gone through a regexp, I'd just use `split` and look at the result you get. Running this gives me the following useful output: $ perl -w 1130518.pl John Doe Joe split correctly, list has 3 elements. - John\|Doe\|Joe John D split correctly, list has 2 elements. - John\|D John split correctly, list has 1 elements. - John John split correctly, list has 1 elements. - John $ [download] Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply] [d/l] [select]


Problems? Is your data what you think it is?
	PerlMonks