Matching data against non-consecutive range

Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching data against non-consecutive range by Mr. Muskrat (Canon) on Jan 27, 2005 at 05:25 UTC
I do not think that I would ever use a regex to match specific numbers in a given range. I would use a lookup table instead. `#!/usr/bin/perl use strict; # Zone 8 my %z8; $z8{$_}++ for ( 10..374, 376..379, 382..385, 388..499, 530..534, 541.. +543, 618, 619, 700..704, 707..709 ); my $num = "34"; print "Found it!\n" if $z8{$num};` [download] Another option would be to pull the zone information using Business::UPS. Store it in a database or use Storable's freeze and thaw to save and load your data. You'll probably also want to determine how often to automatically freshen your information.	[reply] [d/l]
Re^2: Matching data against non-consecutive range by bart (Canon) on Jan 27, 2005 at 22:07 UTC
Using Regex::PreSuf, you can convert that list of numbers into a regex (or at least, a string you can use as a regex). Like this: `use Regex::PreSuf; my $re = presuf(10..374, 376..379, 382..385, 388..499, 530..534, 541.. +543, 618, 619, 700..704, 707..709); print "Regex: /$re/\n";` [download] This prints: Regex: /(?:1(?:0[0123456789]\|1[0123456789]\|2[0123456789]\|3[0123456789] +\|4[0123456789]\|5[0123456789]\|6[0123456789]\|7[0123456789]\|8[0123456789 +]\|9[0123456789]\|[0123456789])\|2(?:0[0123456789]\|1[0123456789]\|2[01234 +56789]\|3[0123456789]\|4[0123456789]\|5[0123456789]\|6[0123456789]\|7[0123 +456789]\|8[0123456789]\|9[0123456789]\|[0123456789])\|3(?:0[0123456789]\|1 +[0123456789]\|2[0123456789]\|3[0123456789]\|4[0123456789]\|5[0123456789]\| +6[0123456789]\|7[012346789]\|8[234589]\|9[0123456789]\|[0123456789])\|4(?: +0[0123456789]\|1[0123456789]\|2[0123456789]\|3[0123456789]\|4[0123456789] +\|5[0123456789]\|6[0123456789]\|7[0123456789]\|8[0123456789]\|9[0123456789 +]\|[0123456789])\|5(?:3[01234]\|4[123]\|[0123456789])\|6(?:1[89]\|[01234567 +89])\|7(?:0[01234789]\|[0123456789])\|8[0123456789]\|9[0123456789])/ [download] Wow, that's big: length($re) is 746 bytes. It won't hurt though, you can just use it to match: `$zip = '34'; if($zip =~ /\b$re\b/o) { print "Got a match for $zip\n"; }` [download] which prints: `Got a match for 34` [download] Considered by jmanning2k - Break Long Line Unconsidered by castaway - Keep/Edit/Delete: 7/7/0 - Get a working browser, see Re^2: Monastery Gates page is too wide (causes) Note by bart: I use Firefox as my standard browser, and it renders correctly for me. I asked on the Chatterbox, and both castaway and corion said it looked normal to them, too. They suggested it could be caused by a difference in site settings for wrapping... shrug I don't know any more.	[reply] [d/l] [select]
Re^3: Matching data against non-consecutive range by demerphq (Chancellor) on Jan 28, 2005 at 14:17 UTC
IMO its a good idea to attach "don't do this" notices to a solution like this. Its hopelessly inefficient in comparison to the hash lookup. --- demerphq	[reply]
Re^4: Matching data against non-consecutive range by bart (Canon) on Jan 30, 2005 at 11:04 UTC
Re^4: Matching data against non-consecutive range by Spooner (Acolyte) on Jan 30, 2005 at 09:32 UTC
Re: Matching data against non-consecutive range by BrowserUk (Patriarch) on Jan 27, 2005 at 12:12 UTC
The memory consumption of this is fixed to the total size of the domain regardless of the number of subranges it contains. Construction and addition of ranges is trivial. Lookup is O(1). If the total range is somewhere in the bounds of sanity, then this is quick and easy and will often use less memory than the related hash or tree structure. If the range is large, then you can use vec and reduce the memory requirement to 1/8 th. It doesn't work well for real numbers--though it can approximate them. #! perl -slw use strict; use List::Util qw[ reduce ]; $a = $b; my $zones = reduce { my( $first, $last ) = split '-', $b; $last \|\|= $first; ## 1; corrected per [bmann]'s post below. substr( $a, $first, $last - $first + 1 ) = 'x' x ( $last - $first ++ 1 ); $a } chr(0) x 1000, split ',', '10-374,376-379,382-385,388-499,530-534,541-543,618,619,700-704,707-70 +9'; print $zones =~ m[^.{$_}x] ? "Found $_!" : "$_ isn't there!" for 9, 374, 375, 376, 999; __END__ [12:03:42.74] P:\test>425464 9 isn't there! Found 374! 375 isn't there! Found 376! 999 isn't there! [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^2: Matching data against non-consecutive range by bmann (Priest) on Jan 27, 2005 at 22:29 UTC
Slick! I wouldn't have thought of using reduce for this. I'll ++ you tomorrow when I get more votes ;) There is a small bug in the implementation, though: single number ranges fail. Add 618 or 619 to the test data, you get "618 isn't there!". Here's a patch to fix it: `# $last \|\|= 1; # needs to be $last \|\|= $first;` [download] And here's another patch that disposes of pre-allocating the 1000 character string for `$zones`. I'm sure it's slower, but it'll work for an arbitrary range (ie, > 1000): `my $zones = reduce { my( $first, $last ) = split '-', $b; $last \|\|= $first; #allocate $a here $a .= chr(0) x ( $last ); substr( $a, $first, $last - $first + 1 ) = 'x' x ( $last - $first ++ 1 ); $a } '', split ',', '10-374,376-379,382-385,388-499,530-534,541-543,618,619,700-704,707-70 +9';` [download] Update: added second patch	[reply] [d/l] [select]
Re^3: Matching data against non-consecutive range by BrowserUk (Patriarch) on Jan 27, 2005 at 22:32 UTC
Patch applied. Many thanks++ Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^4: Matching data against non-consecutive range by bmann (Priest) on Jan 27, 2005 at 22:36 UTC
Re: Matching data against non-consecutive range by Anonymous Monk on Jan 27, 2005 at 11:37 UTC
The ranges in `/[]/` are character ranges. Not numbers. What's between `[]` always matches one character. Never zero. Never more than one. So, if you have `/[10-450]/`, you are matching either a 1, a character between 0 and 4 (inclusive), a 5 or a 0. Which basically means 0, 1, 2, 3, 4 or 5. How to best match in ranges like this depends. It depends on how many matches you are doing (the more, the more time you can spend on doing preprocessing), the number of ranges, and the size of ranges. A couple of techniques (some already been presented): Put all the different numbers in a hash. This gives fast lookup times, but if the ranges are large, this takes a lot of preprocessing time and memory. Doesn't work if both the endpoints and querypoints are reals instead of integers. Iterate over the ranges, testing whether the querypoint falls between the endpoints. This gives a simple algorithm, but the querytime is linear in the number of ranges. Not so good if you have a lot of ranges and a lot of queries to perform. If you have just one query to do, this is the way to go. Binary search over the endpoints. Reasonably fast lookup times (O(log R), with R the number of ranges), so if you have a lot of lookups, this maybe the way to go. Tricky to get right if there are overlapping ranges. Doesn't allow for easy adding or deleting ranges. Segment tree. Basically a binary search tree on the ranges. More code, but same efficiency as a normal binary search. Allows for efficient adding/deleting ranges. Useful if you have a lot of ranges, do a lot of queries, and if your set of ranges isn't static.	[reply] [d/l] [select]
Re: Matching data against non-consecutive range by bass_warrior (Beadle) on Jan 27, 2005 at 13:33 UTC
How about using Set::IntSpan `use strict; use Set::IntSpan; my $z8Set = Set::IntSpan->new("10-374,376-379,382-385,388-499,530-534, +541-543,618,619,700-704,707-709"); my $num = 8; print "Found\n" if $z8Set->number($num);` [download]	[reply] [d/l]
Re: Matching data against non-consecutive range by lidden (Curate) on Jan 27, 2005 at 05:48 UTC
`[]` does not do what you think it does in a regexp. I would do a sub to do what you want done. `# Zone 8 my $z8 = "10-374,376-379,382-385,388-499,530-534,541-543,618,619,700-7 +04,707-709"; my $num = "375"; print "Found it!\n" if in_range($z8, $num); #$num =~/[$z8]/; sub in_range{ my $numbers = shift; my $num = shift; my @ranges = split ',', $numbers; foreach my $r (@ranges){ if (my ($small, $big) = $r =~ /^(\d+)-(\d+)$/){ return 1 if $small <= $num and $big >= $num; } elsif( my ($n) = $r =~ /^(\d+)$/){ return 1 if $n == $num; } else{ die 'Oh noo:-('; } } return 0; }` [download] Update: Fixed bugs	[reply] [d/l] [select]
Re: Matching data against non-consecutive range by fglock (Vicar) on Jan 27, 2005 at 15:03 UTC
Yet another way to do it: `#!/usr/bin/perl use strict; use Set::Infinite; my $z8 = Set::Infinite->new( [10,374], [376,379], [382,385], [388,499], [530,534], [541,543], [618,619], [700,704], [707,709] ); print "Found it!\n" if $z8->contains( 34 );` [download]	[reply] [d/l]
Re: Matching data against non-consecutive range by Popcorn Dave (Abbot) on Jan 28, 2005 at 03:22 UTC
Thanks to everyone who replied! As BrowserUK pointed out, the data should be in the sane range, and it is. Since UPS only looks at the first 3 digits of a zip code, I would be looking at a little less than 1000 elements using a hash structure. However I'm intrigued by the Set::Infinite and Set::IntSpan modules, so I think I may try those before resorting to the hash. Thanks again all! Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks