Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Testing a string for a range of characters

by BoredByPolitics (Scribe)
on Jan 24, 2001 at 19:53 UTC ( [id://54003]=perlquestion: print w/replies, xml ) Need Help??

BoredByPolitics has asked for the wisdom of the Perl Monks concerning the following question:

I asked a question in the CB earlier today regarding testing a string to check that it a) consisted of just one characeter, and b) that character was in the range A to Z.

I'd already come up with the regex /^[A..Z]$/ but I realised that there was probably a much more efficient way of doing it (especially seeing as it'll be happening 1000's of times in the code I am writing).

Corion suggested if (($str ge "A") && ($str le "Z")), while tilly suggested setting up a hash containing the values which I wanted to evaluate as true, then test with if (exists $chrs{$item}).

I decided this was a good time to learn how to use the Benchmark module, so this is what I ran to test all 3 -

use strict; my %tran_types; foreach ('A'..'Z') { $tran_types{$_} = "" }; my $test_char; timethese(10000000, { 'regex' => sub { $test_char = chr( int( rand 256 ) ); if ($test_char =~ /^[A..Z]$/) {} else {} }, 'comp' => sub { $test_char = chr( int( rand 256 ) ); if (($test_char ge "A") && ($test_char le "Z")) {} + else {} }, 'exist' => sub { $test_char = chr( int( rand 256 ) ); if (exists $tran_types{$test_char}) {} else {} }, });

I've no idea whether the above use of rand was a good idea or not, or if I needed to add the else on to the if (it'll be there in the production code). Here are the results -

Benchmark: timing 10000000 iterations of comp, exist, regex... comp: 56 wallclock secs (55.39 usr + 0.00 sys = 55.39 CPU) exist: 50 wallclock secs (49.34 usr + 0.00 sys = 49.34 CPU) regex: 79 wallclock secs (79.32 usr + 0.00 sys = 79.32 CPU)

Clearly tilly's suggestion is the winner, and I get to see first hand why regex's aren't always the best solution, even if they are quicker to type :-)

I don't suppose there's an even quicker solution ;-) ?

Pete - who's very grateful to tilly and Corion

Replies are listed 'Best First'.
Re: Testing a string for a range of characters
by chipmunk (Parson) on Jan 24, 2001 at 21:53 UTC
    I notice that you're trying to use the range operator, .., inside your character class. [A..Z] won't do what you think; it will match an A, or a period, or a Z. The way to specify a range in a character class is with a dash; [A-Z] will match any capital letter.
     

    The part we really want to benchmark is just the code that does the check; the creation of the test character and the if block can be factored out. Note that, except for comp, the actual character doesn't matter, just whether or not it matches.

    Due to your use of rand 256 to generate the characters, your benchmark is heavily weighted towards non-matching instances. I've split each sub into matching and non-matching versions (comp required three non-matching versions).

    I shortened the names of the subs just so the chart would be more compact with all the extra subs I added. :)

    #!perl -w use strict; use Benchmark qw/cmpthese/; my %tran_types; foreach ('A'..'Z') { $tran_types{$_} = "" }; my $char_y = 'J'; # yes my $char_n = '_'; # no my $char_n_l = 'AA'; # no; too long my $char_n_A = '0'; # no; lt 'A' my $char_n_Z = 'a'; # no; gt 'Z' cmpthese(-(shift), { 're_Y' => sub { $char_y =~ /^[A-Z]\z/ }, 're_n' => sub { $char_n =~ /^[A-Z]\z/ }, 'co_Y' => sub { length $char_y == 1 and $char_y ge "A" and $char_y le "Z" }, 'co_nA' => sub { length $char_n_A == 1 and $char_n_A ge "A" and $char_n_A le "Z" }, 'co_nZ' => sub { length $char_n_Z == 1 and $char_n_Z ge "A" and $char_n_Z le "Z" }, 'co_nl' => sub { length $char_n_l == 1 and $char_n_l ge "A" and $char_n_l le "Z" }, 'ex_Y' => sub { exists $tran_types{$char_y} }, 'ex_n' => sub { exists $tran_types{$char_n} }, });
    And the results... I first tried running each snippet for 3 seconds, but the results varied too much from run to run. I forced myself to be patient and run each snippet for 15 seconds. :)
    Benchmark: running co_Y, co_nA, co_nZ, co_nl, ex_Y, ex_n, re_Y, re_n, +each for at least 15 CPU seconds... co_Y: 18 wallclock secs (14.95 usr + 0.05 sys = 15.00 CPU) @ 26 +6935.00/s (n=4004025) co_nA: 15 wallclock secs (15.63 usr + -0.03 sys = 15.60 CPU) @ 35 +5186.35/s (n=5540907) co_nZ: 18 wallclock secs (15.71 usr + 0.10 sys = 15.81 CPU) @ 26 +2582.80/s (n=4151434) co_nl: 17 wallclock secs (16.12 usr + -0.09 sys = 16.03 CPU) @ 72 +0797.75/s (n=11554388) ex_Y: 14 wallclock secs (16.11 usr + -0.02 sys = 16.09 CPU) @ 71 +2203.05/s (n=11459347) ex_n: 14 wallclock secs (16.72 usr + -0.62 sys = 16.10 CPU) @ 86 +2264.35/s (n=13882456) re_Y: 16 wallclock secs (15.18 usr + 0.12 sys = 15.30 CPU) @ 20 +2467.06/s (n=3097746) re_n: 17 wallclock secs (15.97 usr + -0.02 sys = 15.95 CPU) @ 24 +3489.40/s (n=3883656) Rate re_Y re_n co_nZ co_Y co_nA ex_Y co_nl ex_n re_Y 202467/s -- -17% -23% -24% -43% -72% -72% -77% re_n 243489/s 20% -- -7% -9% -31% -66% -66% -72% co_nZ 262583/s 30% 8% -- -2% -26% -63% -64% -70% co_Y 266935/s 32% 10% 2% -- -25% -63% -63% -69% co_nA 355186/s 75% 46% 35% 33% -- -50% -51% -59% ex_Y 712203/s 252% 192% 171% 167% 101% -- -1% -17% co_nl 720798/s 256% 196% 175% 170% 103% 1% -- -16% ex_n 862264/s 326% 254% 228% 223% 143% 21% 20% --
    No surprises here; generally the same results as in your benchmark. The results do show the value of considering matching and non-matching cases. exists is still the winner, executing quickly whether the string matches or not. The revised regex solution is still slower than the others. comp is actually quite fast when the string is too long, but if it has to execute all three comparisons it's much slower.
      Wow - that's an extremely complete writeup :-)

      Due to your use of rand 256 to generate the characters, your benchmark is heavily weighted towards non-matching instances. I've split each sub into matching and non-matching versions (comp required three non-matching versions).

      I used rand to try and avoid the optimisation that I've read Perl will do to what is in effect a literal; but your method seems to take that into account.

      It never ceases to amaze me how much you can learn here :-)

      Pete

Re: Testing a string for a range of characters
by Fastolfe (Vicar) on Jan 24, 2001 at 20:36 UTC
    In addition, if you're really worried this much about performance, your hash creation could be simplified:
    my %hash; @hash{'A'..'Z'} = (); ... if (exists $hash{$whatever}) { ... }
      Aha! That's what tilly had suggested, but without the my %hash; first, and I was getting confused when trying to combine the my and assignment in the same statment! Thanks Fastolfe, I'd not seen that particular usage before :-)

      As regards your other reply, currently I'm shifting off the first element of the array, which tells me what the record type is. I then join the array with a delimiter for writing out to a flat file, minus the transaction type (or rather the transaction type get's built into the key I insert with the join).

      Pete

Re: Testing a string for a range of characters
by Albannach (Monsignor) on Jan 24, 2001 at 21:12 UTC
    Just another WTDI:
    # predefine $testlist as a simple string of the set of acceptable char +acters then 'index' => sub { $test_char = chr( int( rand 256 ) ); if (index($testlist, $test_char) > -1 && length($te +st_char)==1) {} else {} },
    which gives the surprisingly (to me) competitive result:
    Benchmark: timing 10000000 iterations of comp, exist, index, index2, r +egex, regex2, regex3... comp: 44 wallclock secs (41.64 usr + 0.00 sys = 41.64 CPU) @ 24 +0147.93/s (n=10000000) exist: 41 wallclock secs (38.30 usr + 0.00 sys = 38.30 CPU) @ 26 +1117.06/s (n=10000000) index: 42 wallclock secs (41.30 usr + 0.00 sys = 41.30 CPU) @ 24 +2148.34/s (n=10000000) index2: 40 wallclock secs (39.48 usr + 0.00 sys = 39.48 CPU) @ 25 +3260.73/s (n=10000000) regex: 55 wallclock secs (53.28 usr + 0.00 sys = 53.28 CPU) @ 18 +7684.17/s (n=10000000) regex2: 53 wallclock secs (52.34 usr + 0.01 sys = 52.36 CPU) @ 19 +0989.13/s (n=10000000) regex3: 53 wallclock secs (50.97 usr + 0.00 sys = 50.97 CPU) @ 19 +6197.69/s (n=10000000)
    I also tested a couple of variations:
    regex2 - used [A-Z] in place of [A..Z] as I've never used the latter before and can't even figure what it would do at the moment
    regex3 - omitted the ^ and $, which violated the spec but I wanted to know
    index2 - similarly omitted the length() call

    Update: Benchmarks run on NT4 PIII-500 w/256MB RAM, Perl 5.6.0, ActiveState build 618

    --
    I'd like to be able to assign to an luser

      regex2 - used A-Z in place of A..Z as I've never used the latter before and can't even figure what it would do at the moment

      Not what I'd thought it would do! OK, I made a wrong assumption (getting mixed up between array ranges and regex ranges) - you're quite correct, /$[A-Z]^/ is functionally correct.

      I tried your suggestion along side the (now correct) regex and exist, with the following results -

      Benchmark: timing 10000000 iterations of exist, index, regex... exist: 50 wallclock secs (49.39 usr + 0.01 sys = 49.40 CPU) index: 61 wallclock secs (60.64 usr + 0.00 sys = 60.64 CPU) regex: 85 wallclock secs (83.18 usr + 0.03 sys = 83.21 CPU)

      It's interesting that I get a bigger difference between the exist and index versions. I wonder what OS & hardware you're running on? Mine is linux v2.2.18 on a Pentium II 400Mhz with 128Mb RAM.

      Pete

Re: Testing a string for a range of characters
by Fastolfe (Vicar) on Jan 24, 2001 at 20:10 UTC
    It might also pay to ask yourself, "How many times will this test be repeated?" If it's just once, I might suggest that you move your hash creation into your benchmark test, as I suspect the expense involved in the construction of the hash will make the other methods a bit more appealing. If your test will be repeated dozens or more times on the same range, a hash is clearly the better way to approach it.
      The test will be repeated possibly thousands of times, as records are read in from an array, but I agree, if it was only going to be a one off, then I'd probably have left it as a regex.

      Pete

        If you're reading these in from an array, I wonder if a test like this might be faster:
        my @array = get_block_of_data; data_is_bad if join("", @array) =~ /[^A-Z]/;
Re: Testing a string for a range of characters
by I0 (Priest) on Jan 24, 2001 at 21:01 UTC
    Given chr( int( rand 256 ) )
    ($test_char le "Z") && ($test_char ge "A")
    would be faster on average than
    ($test_char ge "A") && ($test_char le "Z")
    but both would accept "AA", contrary to your specification a)
Re: Testing a string for a range of characters
by chromatic (Archbishop) on Jan 25, 2001 at 09:46 UTC
    I can't believe no one mentioned transliteration:

    $test_char =~ tr/A-Z//c;

    Talk about an underappreciated tool.

      I'm at home today, with a much slower machine (P166 64Mb, linux 2.2.18pre21, Perl 5.005_03), so I've reduced the number of iterations to a million.

      These benchmarks are also with the compl routine checking the length() of the test character first, and with the regex fixed -

      Benchmark: timing 1000000 iterations of compl, exist, regex, tr... compl: 17 wallclock secs (17.35 usr + 0.02 sys = 17.37 CPU) exist: 15 wallclock secs (14.02 usr + 0.01 sys = 14.03 CPU) regex: 18 wallclock secs (18.62 usr + 0.00 sys = 18.62 CPU) tr: 16 wallclock secs (16.67 usr + 0.01 sys = 16.68 CPU)

      So, faster than the regex, but still not as fast as the exist() method. Thanks for the suggestion though :-)

      Pete

      Here are two ways to use tr/// to get a complete solution to the problem, checking both the length and the contents of the string: length $test_char == 1 and $test_char =~ tr/A-Z//; $test_char =~ tr/A-Z// == 1 and $test_char =~ tr/A-Z//c == 0;
Could someone explain this ?
by arhuman (Vicar) on Jan 24, 2001 at 20:50 UTC
    ***update***
    OOPS It tooks me too long to type this post, what I said here was said above while I was typing this...
    ***update***

    In another ridiculous attempt to enhance things...
    (Why am I always missing something ?)
    I've tried a to 'enhance' the regex using the o modificator

    So I added :
    'oregex' => sub { $test_char = chr( int( rand 256 ) ); if ($test_char =~ /^[A..Z]$/o) {} else {} },
    to your code...

    The result was disapointing for me...

    comp: 51 wallclock secs (50.38 usr + 0.00 sys = 50.38 CPU) exist: 50 wallclock secs (48.57 usr + 0.00 sys = 48.57 CPU) oregex: 55 wallclock secs (54.11 usr + 0.00 sys = 54.11 CPU) regex: 54 wallclock secs (54.10 usr + 0.00 sys = 54.10 CPU)

    Could someone explain me why the optimization didn't optimize anything ?

    And why my result are so different from yours ? (especially the ratio comp/regex or exist/regex ?)

    I must add that I run the test several time to be sure...ANd the results were consistent...

Re: Testing a string for a range of characters
by dsb (Chaplain) on Jan 24, 2001 at 20:29 UTC
    One of the reasons that a regular expression might be inefficient is that it would compile every time that it was checked. To save the time of doing that you might want to try the 'o' modifier. This modifier will compile the pattern only once. Since you have no variable interpolation you can get away with this. So the regex would look like this:
    $test_char =~ /^[A-Z]$/o;
    Try using that as the regex and then see how fast it is. - kel -
      This is incorrect. The regex is compiled only once so long as its contents have no interpolated variables. The /o flag does nothing in this case.
        Ummmm. It most certainly does :D. There is no variable interpolation going on this Regex. Not the one I put up there anyway. Point it out if you don't mind. - kel -
      Hmm, good suggestion, but these are the results I get -

      Benchmark: timing 10000000 iterations of comp, exist, regex, regexo... comp: 53 wallclock secs (53.96 usr + 0.00 sys = 53.96 CPU) exist: 49 wallclock secs (49.23 usr + 0.00 sys = 49.23 CPU) regex: 78 wallclock secs (77.40 usr + 0.00 sys = 77.40 CPU) regexo: 78 wallclock secs (77.12 usr + 0.00 sys = 77.12 CPU)

      As you can see regexo isn't showing much difference. Could this be because I'm using /^[A..Z]$/o instead of your suggested /^[A-Z]$/o, or are they functionally equivalent?

      Pete

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://54003]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-04-19 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found