Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Check for Spaces in a String

by Anonymous Monk
on Jun 15, 2015 at 18:14 UTC ( [id://1130518]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I am trying to check a string for spaces and only do the further "split" if a word has a space followed by another letter/word in the string. I can't match if the string is like this: "John ".
You can see on the sample code what I am trying to do, I hope:
... my $name = "John Doe Joe"; # should match my $name = "John D"; # should match my $name = "John "; # should not match my $name = "John"; # should not match # Match only if after a space it find a letter if($name=~/[\s+.]+/) { @values = split /[\s+.]+/, $name; print "\n@values\n"; }else{ print "\n No spaces: $name\n"; } ...

Thanks for looking!

Replies are listed 'Best First'.
Re: Check for Spaces in a String
by toolic (Bishop) on Jun 15, 2015 at 18:25 UTC
    use warnings; use strict; while (<DATA>) { if (/\w\s+\w/) { print "yes\n"; } else { print "no\n"; } } __DATA__ John Doe Joe John D John John

    Outputs:

    yes yes no no
      What about this:

      $name=~/\s+\w+/

Re: Check for Spaces in a String
by aaron_baugher (Curate) on Jun 15, 2015 at 20:52 UTC

    To check for a space followed by a word character is simple, though there are a few similar patterns that might serve your needs best:

    $string =~ / \w/; # a space followed by a word character $string =~ /\s\w/; # any whitespace character followed by a word chara +cter $string =~ /\s\S/; # any whitespace character followed by a non-whites +pace character

    However, since you're applying a regex here, it might be just as efficient to go ahead and do the split and then see whether it split anything. That would take a bit more time on the lines that are a single word, but less time on the ones with multiple words:

    #!/usr/bin/env perl use 5.010; use strict; use warnings; my @s = ('John', 'John ', 'John Doe', 'John P. Doe'); # last 2 should +match for (@s){ my @v = split /\s+\b/; # split on whitespace followed by a word bou +ndary if(@v > 1 ){ # if the split did any splitting say; # do stuff with the line or elements } }

    Update: I thought I'd benchmark it (code below), and found that if 50% of the values needed to be split as in the example above, the two methods were equally fast:

    Rate split and check check and split split and check 145/s -- -1% check and split 146/s 1% --

    But when I made it so 75% of the values needed to be split, the "split everything and then check for a second element" method was the clear winner:

    Rate check and split split and check check and split 112/s -- -17% split and check 136/s 21% --

    So it looks like if less than half your lines will need to be split, check first, then split the ones that matched. If more than half will end up being split, just split them all and check for a second element in the resulting array, and go from there. (Incidentally, checking for the second element ($v[1]) was also a gain over checking the number of elements (@v>1) as I originally did.) Here's the benchmarking code:

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use Benchmark qw(:all); use Data::Printer; # my @s = ('John', 'John ', 'John Doe', 'John P. Doe') x 1000; # big a +rray 50% need split my @s = ('John', 'John Poe', 'John Doe', 'John P. Doe') x 1000; # big +array 75% need split cmpthese( 1000, { 'split and check' => \&one, 'check and split' => \&two, }); sub one { for (@s){ my @v = split /\s+\b/; # split on a space followed by a word +boundary if($v[1] ){ # if the split did any splitting # do stuff with the line or elements } } } sub two { for (@s){ if (/\s\b/){ # if the line would be split my @v = split /\s+\b/; # split it # do stuff with the line or element +s } } }

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Check for Spaces in a String
by kcott (Archbishop) on Jun 16, 2015 at 13:20 UTC

    You may be better off doing the initial check without using the regex engine; only using it with split where necessary.

    As you can see from ++aaron_baugher's analysis, your results will depend on your real data. Furthermore, if your volume of data is small, your choice of solution may make little difference (in terms of runtime).

    Here's a solution using substr, rindex and length for the initial check. As a proof-of-concept to show that these functions work on characters (as opposed to bytes), I've included single-byte and multi-byte characters in the data.

    #!/usr/bin/env perl -l use strict; use warnings; use utf8; use open OUT => ':utf8', ':std'; my @strings = ( 'A B C', 'D E', 'F ', 'G', 'H ', 'I ', "\N{MERCURY} \N{FEMALE SIGN} \N{EARTH}", "\N{MALE SIGN} \N{JUPITER +}", "\N{SATURN} ", "\N{URANUS}", "\N{NEPTUNE} ", "\N{PLUTO} ", ); for (@strings) { next if substr($_, -1, 1) eq ' ' or rindex($_, ' ', length() - 2) +== -1; print; }

    Output:

    A B C
    D E
    ☿ ♀ ♁
    ♂ ♃
    

    [The Unicode range of characters labelled "Astrological symbols" is 0x263d to 0x2647. There is no charname for "VENUS" or "MARS"; the charnames "FEMALE SIGN" and "MALE SIGN" are defined for these symbols, respectively.]

    Here's a benchmark test. This uses my sample data. If you choose this route, you should benchmark with representative samples of your data.

    #!/usr/bin/env perl -l use strict; use warnings; use Benchmark qw{cmpthese}; my @strings = ( 'A B C', 'D E', 'F ', 'G', 'H ', 'I ', "\N{MERCURY} \N{FEMALE SIGN} \N{EARTH}", "\N{MALE SIGN} \N{JUPITER +}", "\N{SATURN} ", "\N{URANUS}", "\N{NEPTUNE} ", "\N{PLUTO} ", ); my $re = qr{\s+\b}; cmpthese -1 => { no_re_check_and_split => \&no_re_check_and_split, re_check_and_split => \&re_check_and_split, split_and_re_check => \&split_and_re_check, }; sub no_re_check_and_split { for (@strings) { next if substr($_, -1, 1) eq ' ' or rindex($_, ' ', length() - + 2) == -1; my @parts = split /$re/; } } sub re_check_and_split { for (@strings) { next unless /$re/; my @parts = split /$re/; } } sub split_and_re_check { for (@strings) { my @parts = split /$re/; next if @parts > 1; } }

    Here's three sample runs:

    Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 50243/s -- -18% + -29% re_check_and_split 61265/s 22% -- + -13% no_re_check_and_split 70274/s 40% 15% + -- Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 53593/s -- -19% + -27% re_check_and_split 66370/s 24% -- + -10% no_re_check_and_split 73770/s 38% 11% + -- Rate split_and_re_check re_check_and_split no +_re_check_and_split split_and_re_check 53096/s -- -20% + -27% re_check_and_split 66369/s 25% -- + -8% no_re_check_and_split 72404/s 36% 9% + -- ken@ganymede: ~/tmp

    With my sample data, doing the initial check without a regex appears faster. Again, I'll stress, you'll need to check with your data.

    -- Ken

Re: Check for Spaces in a String
by talexb (Chancellor) on Jun 16, 2015 at 18:36 UTC

    In my opinion, you should only be using a regular expression when the simpler solutions can't handle the problem. In this case, you can manage by just using split. Here's how:

    #!/usr/bin/perl use strict; use warnings; { my @names = ( [ 'John Doe Joe', 1 ], [ 'John D', 1 ], [ 'John ', 0 ], [ 'John', 0 ], ); foreach my $lr (@names) { my ( $name, $success ) = @{$lr}; my @result = split( /\s/, $name ); if ( ( @result > 1 && $success ) || ( @result == 1 && !$success ) ) { print "$name split correctly, list has " . ( scalar @result ) . " elements. - "; } else { print "$name split incorrectly, list has " . ( scalar @result ) . " elements. - "; } print join('|',@result) . "\n"; } }

    Rather than only using split after you've gone through a regexp, I'd just use split and look at the result you get. Running this gives me the following useful output:

    $ perl -w 1130518.pl John Doe Joe split correctly, list has 3 elements. - John|Doe|Joe John D split correctly, list has 2 elements. - John|D John split correctly, list has 1 elements. - John John split correctly, list has 1 elements. - John $

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1130518]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-04-18 18:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found