Re: Re: Re: Bloom::Filter Usage

Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn.

If it would help, here is a badly named and almost undocumented module that may see light of day sometime.

It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram.

The performance isn't too bad at an average of 40 usecs per lookup in my test application below.

package Algorithm::BLT;

use strict;
use warnings;
use List::Util qw[ reduce ];

our $VERSION = '0.01';

sub new{
    my( %opts ) = @_;
    
    my( $cvtFmt, $keyFmt, $tailFmt ) = 
        delete @opts{ qw[ convert key tail ] };

    die "Usage: " . __PACKAGE__ . 
        "::new( convert => fmt, key => fmt, tail => fmt );\n"
        unless 3 == grep{ defined } $cvtFmt, $keyFmt, $tailFmt;

    warn "Ignoring extraneous options: ", %opts if %opts;
    
    my %lookup;
    my $seen = sub { 
        use bytes;
        my $n = pack( $cvtFmt, shift );
        my $key = unpack $keyFmt, $n;
        my $tail = join '', unpack $tailFmt, $n;
        my $len = length $tail;
        if( exists $lookup{ $key } ) {
            no warnings;
            my $ref = \$lookup{ $key };
            return 1 if $$ref =~ m[(?<=^.{$len})*\Q$tail\E];
        }
        $lookup{ $key } .= $tail;
        return 0;   
    };
    return $seen unless wantarray;
    
    my $stats = sub{
        return sprintf 'keys: %d bytes: %d', 
            scalar keys %lookup, 
            reduce{ $a += length $b } 0, values %lookup;
    };
    my $hRef = sub { return \%lookup };
    return $seen, $stats, $hRef;
}


1;
__END__
=head1 NAME 

Algorithm::BLT - Perl extension for creating very large, fast lookup t
+ables.

(BLT derives from Big Lookup Tables.)

=head1 SYNOPSIS

  use Algorithm::BLT;
  
  my $seen = Algorith::BLT::new( 
      convert => 'd', 
      key       => 'x4 a2',
      tail      => 'a4 x2 a4'
  );

  for my $thingsToLookUp ( ... ) {
      if( $seen->( $thingsToLookUp ) ) {
          # Seen it already
      }
      else{ 
          # We got a new one
      }
  }

=head1 DESCRIPTION

Typically, perl programs use hashes to effect fast loopup tables. 
Whilst this is very simply to code and very fast to use, the trade-off
+ 
is that they use large volumes of memory. 

This module attempts to address the case where the program only requir
+es 
the lookup facility without all the other features that a hash provide
+s, 
like values associated with the keys or the ability to iterate the val
+ues,
and which are therefore just overhead for the pure lookup table usage.

=head2 PARAMETERS

Currently 3 named parameters are possible and required. Each takes the
+ 
form of pack/unpack format strings. 

=over 4 convert

convert - this format is used to convert input values to their binary 
form for reduce storage footprint. For example, if the values to be st
+ored 
in the lookup are 12 digit numbers, these can be reduce to 8 bytes by 
converting them to their 'double' binary floating point representation
+ 
using 'd' for this parameter.

=over 4 key

key - this format is used to extract the key from the converted values
+.
This should be chosen to give the best spread of values across the 
'buckets' (hash elements) that are used as the first pass selection 
mechanism. In the case of double floats, the 5th and 6th bytes do 
this better than any others bytes, so the format used is 'x4 a2';

=over 4 tail

Tail -  this format is used to extract the rest of the data (the tail)
+ 
from the converted value for storage within the buckets. In the case 
of the double float, the tail is split into 2 parts, the first 4 bytes
and the last two. The output of the format will be concatenated to for
+m
the secondary key. The format to use in the double float case is 
therefore 'a4 x2 a2';

=head2 EXPORT

None.

=head1 AUTHOR

BrowserUk

=head1 COPYRIGHT AND LICENSE

Copyright (C) 2004 by BrowserUk

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.3 or,
at your option, any later version of Perl 5 you may have available.

=cut
[download]

My testcase.

#! perl -slw
use strict;
use List::Util qw[ first reduce ];
use Math::Random::MT qw[ rand ];
use Benchmark::Timer;
use Algorithm::BLT;

$| = 1;

our $MAX ||= 1000000;

my( $new, $exists ) = (0) x 2;
my $seen = Algorithm::BLT::new( 
    convert => 'd', 
    key => 'x4 a2', 
    tail => 'a4 x2 a2' 
);

my $T = new Benchmark::Timer;
$T->start( 'test' );

for( 1 .. $MAX ) {
    my $n = int( rand 1_000_000_000_000 );
    $seen->( $n ) ? $exists++ : $new++;
    printf "\r%10d : %10d", $exists, $new 
          unless $_ %1000;
}
print $/;

$T->stop( 'test' );
$T->report;

printf 'Record mem. usage? '; <STDIN>;

__END__
P:\test>346619 -MAX=30000000
    104190 :   29895810

1 trial of test (1,252s total)

Record mem. usage? 215,412 KB
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Comment on Re: Re: Re: Bloom::Filter Usage Select or Download Code


good chemistry is complicated, and a little bit messy -LW
	PerlMonks