Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

From the code, I think any string will do as a salt. All it's doing is passing it as the second argument to sha1(), and all that does is append it to the first argument before hashing. The reason it talks about salts plural is that instead of using different hashing algorithms, it gets its vector coordinates by reusing the same algorithm with a different addendum. Throw some arbitrary but consistent four letter words at it and see what happens.

More generally, it seems like a neat approach but will require at least two passes to really work. Your second pass will have to weed out the false positives by rescanning the data set with a simple hash-based tally restricted to the output of the first pass. In that case you can choose the accuracy based on the number of false positives you can afford to screen for, and even 0.1% would give you a fairly manageable hash of 30,000 keys.

Incidentally, I'm no pack() leader, but if I were you I would think about rolling your own using only some bits of Bloom::Filter. It's biased towards working with email addresses, which are very heterogeneous, and there may well be regularities in your data set (they are IDs, after all) that make other algorithms much more efficient and discriminatory than just applying and reapplying the same hashing routine.

update: according to the perl.com article, with 30million keys and 0.1% accuracy, you need 10 hash functions to get the most memory-efficient filter. For 0.01% it's 13 functions and 30% more memory use, and so on:

my ( $m, $k ) = optimise(30000000, 0.0001); print <<""; memory usage = $m hash functions = $k sub optimise { my ( $num_keys, $error_rate ) = @_; my $lowest_m; my $best_k = 1; foreach my $k ( 1..100 ) { my $m = (-1 * $k * $num_keys) / ( log( 1 - ($error_rate ** (1/$k)))); if ( !defined $lowest_m or ($m < $lowest_m) ) { $lowest_m = $m; $best_k = $k; } } return ( $lowest_m, $best_k ); }

In reply to Re: Bloom::Filter Usage by thpfft
in thread Bloom::Filter Usage by jreades

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2024-04-23 10:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found