Testing a string for a range of characters

BoredByPolitics has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Testing a string for a range of characters by chipmunk (Parson) on Jan 24, 2001 at 21:53 UTC
I notice that you're trying to use the range operator, `..`, inside your character class. `[A..Z]` won't do what you think; it will match an A, or a period, or a Z. The way to specify a range in a character class is with a dash; `[A-Z]` will match any capital letter. The part we really want to benchmark is just the code that does the check; the creation of the test character and the if block can be factored out. Note that, except for comp, the actual character doesn't matter, just whether or not it matches. Due to your use of `rand 256` to generate the characters, your benchmark is heavily weighted towards non-matching instances. I've split each sub into matching and non-matching versions (comp required three non-matching versions). I shortened the names of the subs just so the chart would be more compact with all the extra subs I added. :) #!perl -w use strict; use Benchmark qw/cmpthese/; my %tran_types; foreach ('A'..'Z') { $tran_types{$_} = "" }; my $char_y = 'J'; # yes my $char_n = '_'; # no my $char_n_l = 'AA'; # no; too long my $char_n_A = '0'; # no; lt 'A' my $char_n_Z = 'a'; # no; gt 'Z' cmpthese(-(shift), { 're_Y' => sub { $char_y =~ /^[A-Z]\z/ }, 're_n' => sub { $char_n =~ /^[A-Z]\z/ }, 'co_Y' => sub { length $char_y == 1 and $char_y ge "A" and $char_y le "Z" }, 'co_nA' => sub { length $char_n_A == 1 and $char_n_A ge "A" and $char_n_A le "Z" }, 'co_nZ' => sub { length $char_n_Z == 1 and $char_n_Z ge "A" and $char_n_Z le "Z" }, 'co_nl' => sub { length $char_n_l == 1 and $char_n_l ge "A" and $char_n_l le "Z" }, 'ex_Y' => sub { exists $tran_types{$char_y} }, 'ex_n' => sub { exists $tran_types{$char_n} }, }); [download] And the results... I first tried running each snippet for 3 seconds, but the results varied too much from run to run. I forced myself to be patient and run each snippet for 15 seconds. :) Benchmark: running co_Y, co_nA, co_nZ, co_nl, ex_Y, ex_n, re_Y, re_n, +each for at least 15 CPU seconds... co_Y: 18 wallclock secs (14.95 usr + 0.05 sys = 15.00 CPU) @ 26 +6935.00/s (n=4004025) co_nA: 15 wallclock secs (15.63 usr + -0.03 sys = 15.60 CPU) @ 35 +5186.35/s (n=5540907) co_nZ: 18 wallclock secs (15.71 usr + 0.10 sys = 15.81 CPU) @ 26 +2582.80/s (n=4151434) co_nl: 17 wallclock secs (16.12 usr + -0.09 sys = 16.03 CPU) @ 72 +0797.75/s (n=11554388) ex_Y: 14 wallclock secs (16.11 usr + -0.02 sys = 16.09 CPU) @ 71 +2203.05/s (n=11459347) ex_n: 14 wallclock secs (16.72 usr + -0.62 sys = 16.10 CPU) @ 86 +2264.35/s (n=13882456) re_Y: 16 wallclock secs (15.18 usr + 0.12 sys = 15.30 CPU) @ 20 +2467.06/s (n=3097746) re_n: 17 wallclock secs (15.97 usr + -0.02 sys = 15.95 CPU) @ 24 +3489.40/s (n=3883656) Rate re_Y re_n co_nZ co_Y co_nA ex_Y co_nl ex_n re_Y 202467/s -- -17% -23% -24% -43% -72% -72% -77% re_n 243489/s 20% -- -7% -9% -31% -66% -66% -72% co_nZ 262583/s 30% 8% -- -2% -26% -63% -64% -70% co_Y 266935/s 32% 10% 2% -- -25% -63% -63% -69% co_nA 355186/s 75% 46% 35% 33% -- -50% -51% -59% ex_Y 712203/s 252% 192% 171% 167% 101% -- -1% -17% co_nl 720798/s 256% 196% 175% 170% 103% 1% -- -16% ex_n 862264/s 326% 254% 228% 223% 143% 21% 20% -- [download] No surprises here; generally the same results as in your benchmark. The results do show the value of considering matching and non-matching cases. exists is still the winner, executing quickly whether the string matches or not. The revised regex solution is still slower than the others. comp is actually quite fast when the string is too long, but if it has to execute all three comparisons it's much slower.	[reply] [d/l] [select]
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 24, 2001 at 22:14 UTC
Wow - that's an extremely complete writeup :-) Due to your use of rand 256 to generate the characters, your benchmark is heavily weighted towards non-matching instances. I've split each sub into matching and non-matching versions (comp required three non-matching versions). I used `rand` to try and avoid the optimisation that I've read Perl will do to what is in effect a literal; but your method seems to take that into account. It never ceases to amaze me how much you can learn here :-) Pete	[reply] [d/l]
Re: Testing a string for a range of characters by Fastolfe (Vicar) on Jan 24, 2001 at 20:36 UTC
In addition, if you're really worried this much about performance, your hash creation could be simplified: `my %hash; @hash{'A'..'Z'} = (); ... if (exists $hash{$whatever}) { ... }` [download]	[reply] [d/l]
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 24, 2001 at 21:02 UTC
Aha! That's what tilly had suggested, but without the `my %hash;` first, and I was getting confused when trying to combine the my and assignment in the same statment! Thanks Fastolfe, I'd not seen that particular usage before :-) As regards your other reply, currently I'm `shift`ing off the first element of the array, which tells me what the record type is. I then join the array with a delimiter for writing out to a flat file, minus the transaction type (or rather the transaction type get's built into the key I insert with the join). Pete	[reply] [d/l] [select]
Re: Testing a string for a range of characters by Albannach (Monsignor) on Jan 24, 2001 at 21:12 UTC
Just another WTDI: `# predefine $testlist as a simple string of the set of acceptable char +acters then 'index' => sub { $test_char = chr( int( rand 256 ) ); if (index($testlist, $test_char) > -1 && length($te +st_char)==1) {} else {} },` [download] which gives the surprisingly (to me) competitive result: Benchmark: timing 10000000 iterations of comp, exist, index, index2, r +egex, regex2, regex3... comp: 44 wallclock secs (41.64 usr + 0.00 sys = 41.64 CPU) @ 24 +0147.93/s (n=10000000) exist: 41 wallclock secs (38.30 usr + 0.00 sys = 38.30 CPU) @ 26 +1117.06/s (n=10000000) index: 42 wallclock secs (41.30 usr + 0.00 sys = 41.30 CPU) @ 24 +2148.34/s (n=10000000) index2: 40 wallclock secs (39.48 usr + 0.00 sys = 39.48 CPU) @ 25 +3260.73/s (n=10000000) regex: 55 wallclock secs (53.28 usr + 0.00 sys = 53.28 CPU) @ 18 +7684.17/s (n=10000000) regex2: 53 wallclock secs (52.34 usr + 0.01 sys = 52.36 CPU) @ 19 +0989.13/s (n=10000000) regex3: 53 wallclock secs (50.97 usr + 0.00 sys = 50.97 CPU) @ 19 +6197.69/s (n=10000000) [download] I also tested a couple of variations: regex2 - used `[A-Z]` in place of `[A..Z]` as I've never used the latter before and can't even figure what it would do at the moment regex3 - omitted the ^ and $, which violated the spec but I wanted to know index2 - similarly omitted the length() call Update: Benchmarks run on NT4 PIII-500 w/256MB RAM, Perl 5.6.0, ActiveState build 618 -- I'd like to be able to assign to an luser	[reply] [d/l] [select]
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 24, 2001 at 21:44 UTC
regex2 - used A-Z in place of A..Z as I've never used the latter before and can't even figure what it would do at the moment Not what I'd thought it would do! OK, I made a wrong assumption (getting mixed up between array ranges and regex ranges) - you're quite correct, `/$[A-Z]^/` is functionally correct. I tried your suggestion along side the (now correct) regex and exist, with the following results - `Benchmark: timing 10000000 iterations of exist, index, regex... exist: 50 wallclock secs (49.39 usr + 0.01 sys = 49.40 CPU) index: 61 wallclock secs (60.64 usr + 0.00 sys = 60.64 CPU) regex: 85 wallclock secs (83.18 usr + 0.03 sys = 83.21 CPU)` [download] It's interesting that I get a bigger difference between the exist and index versions. I wonder what OS & hardware you're running on? Mine is linux v2.2.18 on a Pentium II 400Mhz with 128Mb RAM. Pete	[reply] [d/l] [select]
Re: Testing a string for a range of characters by Fastolfe (Vicar) on Jan 24, 2001 at 20:10 UTC
It might also pay to ask yourself, "How many times will this test be repeated?" If it's just once, I might suggest that you move your hash creation into your benchmark test, as I suspect the expense involved in the construction of the hash will make the other methods a bit more appealing. If your test will be repeated dozens or more times on the same range, a hash is clearly the better way to approach it.	[reply]
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 24, 2001 at 20:25 UTC
The test will be repeated possibly thousands of times, as records are read in from an array, but I agree, if it was only going to be a one off, then I'd probably have left it as a regex. Pete	[reply]
Re: Re: Re: Testing a string for a range of characters by Fastolfe (Vicar) on Jan 24, 2001 at 20:31 UTC
If you're reading these in from an array, I wonder if a test like this might be faster: `my @array = get_block_of_data; data_is_bad if join("", @array) =~ /[^A-Z]/;` [download]	[reply] [d/l]
Re: Testing a string for a range of characters by I0 (Priest) on Jan 24, 2001 at 21:01 UTC
Given `chr( int( rand 256 ) )` `($test_char le "Z") && ($test_char ge "A")` would be faster on average than `($test_char ge "A") && ($test_char le "Z")` but both would accept "AA", contrary to your specification a)	[reply] [d/l] [select]
Re: Testing a string for a range of characters by chromatic (Archbishop) on Jan 25, 2001 at 09:46 UTC
I can't believe no one mentioned transliteration: `$test_char =~ tr/A-Z//c;` Talk about an underappreciated tool.	[reply] [d/l]
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 25, 2001 at 14:28 UTC
I'm at home today, with a much slower machine (P166 64Mb, linux 2.2.18pre21, Perl 5.005_03), so I've reduced the number of iterations to a million. These benchmarks are also with the compl routine checking the length() of the test character first, and with the regex fixed - `Benchmark: timing 1000000 iterations of compl, exist, regex, tr... compl: 17 wallclock secs (17.35 usr + 0.02 sys = 17.37 CPU) exist: 15 wallclock secs (14.02 usr + 0.01 sys = 14.03 CPU) regex: 18 wallclock secs (18.62 usr + 0.00 sys = 18.62 CPU) tr: 16 wallclock secs (16.67 usr + 0.01 sys = 16.68 CPU)` [download] So, faster than the regex, but still not as fast as the exist() method. Thanks for the suggestion though :-) Pete	[reply] [d/l]
Re: Re: Testing a string for a range of characters by chipmunk (Parson) on Jan 25, 2001 at 21:02 UTC
Here are two ways to use tr/// to get a complete solution to the problem, checking both the length and the contents of the string: `length $test_char == 1 and $test_char =~ tr/A-Z//;` `$test_char =~ tr/A-Z// == 1 and $test_char =~ tr/A-Z//c == 0;`	[reply] [d/l] [select]
Could someone explain this ? by arhuman (Vicar) on Jan 24, 2001 at 20:50 UTC
*update* OOPS It tooks me too long to type this post, what I said here was said above while I was typing this... *update* In another ridiculous attempt to enhance things... (Why am I always missing something ?) I've tried a to 'enhance' the regex using the o modificator So I added : `'oregex' => sub { $test_char = chr( int( rand 256 ) ); if ($test_char =~ /^[A..Z]$/o) {} else {} },` [download] to your code... The result was disapointing for me... `comp: 51 wallclock secs (50.38 usr + 0.00 sys = 50.38 CPU) exist: 50 wallclock secs (48.57 usr + 0.00 sys = 48.57 CPU) oregex: 55 wallclock secs (54.11 usr + 0.00 sys = 54.11 CPU) regex: 54 wallclock secs (54.10 usr + 0.00 sys = 54.10 CPU)` [download] Could someone explain me why the optimization didn't optimize anything ? And why my result are so different from yours ? (especially the ratio comp/regex or exist/regex ?) I must add that I run the test several time to be sure...ANd the results were consistent...	[reply] [d/l] [select]
Re: Testing a string for a range of characters by dsb (Chaplain) on Jan 24, 2001 at 20:29 UTC
One of the reasons that a regular expression might be inefficient is that it would compile every time that it was checked. To save the time of doing that you might want to try the 'o' modifier. This modifier will compile the pattern only once. Since you have no variable interpolation you can get away with this. So the regex would look like this: `$test_char =~ /^[A-Z]$/o;` [download] Try using that as the regex and then see how fast it is. - kel -	[reply] [d/l]
Re: Re: Testing a string for a range of characters by Fastolfe (Vicar) on Jan 24, 2001 at 20:35 UTC
This is incorrect. The regex is compiled only once so long as its contents have no interpolated variables. The /o flag does nothing in this case.	[reply]
Re: Re: Re: Testing a string for a range of characters by dsb (Chaplain) on Jan 26, 2001 at 21:29 UTC
Ummmm. It most certainly does :D. There is no variable interpolation going on this Regex. Not the one I put up there anyway. Point it out if you don't mind. - kel -	[reply]
Re: Re: Re: Re: Testing a string for a range of characters by Fastolfe (Vicar) on Jan 26, 2001 at 22:29 UTC
Re: Re: Re: Re: Re: Testing a string for a range of characters by dsb (Chaplain) on Jan 26, 2001 at 22:54 UTC
Re: Re: Testing a string for a range of characters by BoredByPolitics (Scribe) on Jan 24, 2001 at 20:49 UTC
Hmm, good suggestion, but these are the results I get - `Benchmark: timing 10000000 iterations of comp, exist, regex, regexo... comp: 53 wallclock secs (53.96 usr + 0.00 sys = 53.96 CPU) exist: 49 wallclock secs (49.23 usr + 0.00 sys = 49.23 CPU) regex: 78 wallclock secs (77.40 usr + 0.00 sys = 77.40 CPU) regexo: 78 wallclock secs (77.12 usr + 0.00 sys = 77.12 CPU)` [download] As you can see `regexo` isn't showing much difference. Could this be because I'm using `/^[A..Z]$/o` instead of your suggested `/^[A-Z]$/o`, or are they functionally equivalent? Pete	[reply] [d/l] [select]


Pathologically Eclectic Rubbish Lister
	PerlMonks