Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn.
If it would help, here is a badly named and almost undocumented module that may see light of day sometime.
It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram.
The performance isn't too bad at an average of 40 usecs per lookup in my test application below.
package Algorithm::BLT;
use strict;
use warnings;
use List::Util qw[ reduce ];
our $VERSION = '0.01';
sub new{
my( %opts ) = @_;
my( $cvtFmt, $keyFmt, $tailFmt ) =
delete @opts{ qw[ convert key tail ] };
die "Usage: " . __PACKAGE__ .
"::new( convert => fmt, key => fmt, tail => fmt );\n"
unless 3 == grep{ defined } $cvtFmt, $keyFmt, $tailFmt;
warn "Ignoring extraneous options: ", %opts if %opts;
my %lookup;
my $seen = sub {
use bytes;
my $n = pack( $cvtFmt, shift );
my $key = unpack $keyFmt, $n;
my $tail = join '', unpack $tailFmt, $n;
my $len = length $tail;
if( exists $lookup{ $key } ) {
no warnings;
my $ref = \$lookup{ $key };
return 1 if $$ref =~ m[(?<=^.{$len})*\Q$tail\E];
}
$lookup{ $key } .= $tail;
return 0;
};
return $seen unless wantarray;
my $stats = sub{
return sprintf 'keys: %d bytes: %d',
scalar keys %lookup,
reduce{ $a += length $b } 0, values %lookup;
};
my $hRef = sub { return \%lookup };
return $seen, $stats, $hRef;
}
1;
__END__
=head1 NAME
Algorithm::BLT - Perl extension for creating very large, fast lookup t
+ables.
(BLT derives from Big Lookup Tables.)
=head1 SYNOPSIS
use Algorithm::BLT;
my $seen = Algorith::BLT::new(
convert => 'd',
key => 'x4 a2',
tail => 'a4 x2 a4'
);
for my $thingsToLookUp ( ... ) {
if( $seen->( $thingsToLookUp ) ) {
# Seen it already
}
else{
# We got a new one
}
}
=head1 DESCRIPTION
Typically, perl programs use hashes to effect fast loopup tables.
Whilst this is very simply to code and very fast to use, the trade-off
+
is that they use large volumes of memory.
This module attempts to address the case where the program only requir
+es
the lookup facility without all the other features that a hash provide
+s,
like values associated with the keys or the ability to iterate the val
+ues,
and which are therefore just overhead for the pure lookup table usage.
=head2 PARAMETERS
Currently 3 named parameters are possible and required. Each takes the
+
form of pack/unpack format strings.
=over 4 convert
convert - this format is used to convert input values to their binary
form for reduce storage footprint. For example, if the values to be st
+ored
in the lookup are 12 digit numbers, these can be reduce to 8 bytes by
converting them to their 'double' binary floating point representation
+
using 'd' for this parameter.
=over 4 key
key - this format is used to extract the key from the converted values
+.
This should be chosen to give the best spread of values across the
'buckets' (hash elements) that are used as the first pass selection
mechanism. In the case of double floats, the 5th and 6th bytes do
this better than any others bytes, so the format used is 'x4 a2';
=over 4 tail
Tail - this format is used to extract the rest of the data (the tail)
+
from the converted value for storage within the buckets. In the case
of the double float, the tail is split into 2 parts, the first 4 bytes
and the last two. The output of the format will be concatenated to for
+m
the secondary key. The format to use in the double float case is
therefore 'a4 x2 a2';
=head2 EXPORT
None.
=head1 AUTHOR
BrowserUk
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2004 by BrowserUk
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.3 or,
at your option, any later version of Perl 5 you may have available.
=cut
My testcase.
#! perl -slw
use strict;
use List::Util qw[ first reduce ];
use Math::Random::MT qw[ rand ];
use Benchmark::Timer;
use Algorithm::BLT;
$| = 1;
our $MAX ||= 1000000;
my( $new, $exists ) = (0) x 2;
my $seen = Algorithm::BLT::new(
convert => 'd',
key => 'x4 a2',
tail => 'a4 x2 a2'
);
my $T = new Benchmark::Timer;
$T->start( 'test' );
for( 1 .. $MAX ) {
my $n = int( rand 1_000_000_000_000 );
$seen->( $n ) ? $exists++ : $new++;
printf "\r%10d : %10d", $exists, $new
unless $_ %1000;
}
print $/;
$T->stop( 'test' );
$T->report;
printf 'Record mem. usage? '; <STDIN>;
__END__
P:\test>346619 -MAX=30000000
104190 : 29895810
1 trial of test (1,252s total)
Record mem. usage? 215,412 KB
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail