Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Regex for Differentiating Underscore and Whitespace

by mwah (Hermit)
on Nov 03, 2007 at 18:03 UTC ( [id://648816]=note: print w/replies, xml ) Need Help??


in reply to Regex for Differentiating Underscore and Whitespace

neversaint
what's wrong with my script above such that it prints no underscore instead of with underscore

There have been already correcting hints in all directions by moritz and CountZero. From analyzing your code, jasonk pointed out your misconception on /x and whitespace, which means your code would work as intended if you change the regex modifier to:

... if ( $str =~ / / ) { print "no underscore\n"; } else { print "with underscore\n"; } ...

The /x modifier would lead the regex to ignore the space (as has been said) and the /m and /s aren't needed here (they don't do anything)

In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

In addition to these hints, davido tackles the problem by an important feature of the tr// (transliteration) operator - to count occurrences of characters in very efficient way. This would reduce your problem to the following expression:

... my $str = $ARGV[0] || '|78187980|ref|NM_0'; # original stri +ng my $cnt = $str =~ tr/_//; # count the num +ber of underscores print 'with ' . ($cnt || 'no') . " underscore(s)\n"; # print result +depending on count ...

after which you may decide on the 'count' of the character in question.

Regards

mwa

Replies are listed 'Best First'.
Re^2: Regex for Differentiating Underscore and Whitespace
by blazar (Canon) on Nov 04, 2007 at 11:15 UTC
    In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

    I personally believe that the claim about efficiency is not correct, since that kind of regex should get optimized to index anyway - and often regexen have a more immediately readable syntax. For a Perl programmer that is...

    I hope that the following minimal benchmark can shed some light:

    #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese :hireswallclock/; my @a = do { my @chr=(grep /\w/, map chr, 1..255); map { local $_ = join '', map $chr[rand @chr], 1..1000; tr/_/ / if .5<rand; $_; } 1..1000; }; cmpthese 5000 => { Regex => sub () { grep !/_/, @a }, Index => sub () { grep index($_, '_') < 0, @a } }; __END__

    I get e.g.

    C:\temp>perl index.pl Rate Index Regex Index 891/s -- -0% Regex 891/s 0% --

    and

    blazar@perlmonk ~ $ perl index.pl Rate Index Regex Index 261/s -- -0% Regex 262/s 0% --

    on two different systems.

    Now, is this test flawed? I easily tend to get these kinda things wrong, I must admit...

      blazar
      Now, is this test flawed?

      You are basically correct here. I was too zealous here to advertise the vantages of index() and tr//. They have their run elsewhere, but not in this special case. Thanks for pointing this out.

      I abused your benchmark code (of course) to find out on how good the index() optimization in Perl5 really is ;-)

      ... use Benchmark qw/cmpthese :hireswallclock/; my @a = map { my $s='PM is cool, ' x 10_000; substr($s, rand(length $s), 1, '_'); $s } 1..1000; cmpthese -3 => { C_Idx => sub () { grep C_Idx($_, '_') < 0, @a }, Index => sub () { grep index($_, '_') < 0, @a }, Regex => sub () { grep ! /_/, @a }, Tr => sub () { grep ! tr/_//, @a } }; use Inline C => qq[ int C_Idx(SV* src, SV* chr) { STRLEN srclen, chrlen; char *ssrc = SvPV(src, srclen), *schr = SvPV(chr, chrlen); char *p = ssrc; if( chrlen != 1 ) croak("single characters only for now!"); return (p=memchr(p, *schr, srclen)) != NULL ? p-ssrc : -1; } ]; ...

      On my system, somehow above 60-70K strings - the index() falls behind the c-library function for finding a character (memchr). For the above strings:

      Rate Tr Regex Index C_Idx Tr 3.17/s -- -74% -74% -87% Regex 12.2/s 284% -- -0% -52% Index 12.2/s 285% 0% -- -52% C_Idx 25.2/s 696% 107% 107% --

      I personally believe it'd be much better If I'd read my own posts and think about their assumptions next time much more thoroughly ;-)

      Regards

      mwa

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://648816]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (8)
As of 2024-04-23 17:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found