Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Bloom Filter or other mehod to store URL's?

by Jaap (Curate)
on Apr 14, 2005 at 13:50 UTC ( [id://447781]=perlquestion: print w/replies, xml ) Need Help??

Jaap has asked for the wisdom of the Perl Monks concerning the following question:

For fun & excercise, i am writing an application that recursively checks the links in an HTML document on a webserver.
If the URLs it has encountered are stored, a link only has to be checked once so i want to store the URLs for the sole purpose of seeing whether i encountered them before.

This could be done using a sorted array (lots of mem, binary search, O(log N)), a hash (lots of mem, O(1)) or a Bloom Filter (low mem, false positives, O(1)).

I think the Bloom Filter is the best option here. Does anyone have a different/better option?
  • Comment on Bloom Filter or other mehod to store URL's?

Replies are listed 'Best First'.
Re: Bloom Filter or other mehod to store URL's?
by dmorelli (Scribe) on Apr 14, 2005 at 14:32 UTC
    I just did some reading at the page you linked to (Bloom Filter) and the discussion of the math that's linked from there.

    I was never that fabulous with math, but I think I get it. Seems like a good plan for a very large set. I say do it.

    I had never seen the Bloom Filter, it was cool to read about. Thank you!

    PS: I found the discussion of the math page to be easier to grasp. It has diagrams.

    Update: I was wondering, with the Bloom thing, how do you come up with suitable functions to generate the hashes?

    Update: I found this at perl.com: Using Bloom Filters

      Stuff like MD5 can be used for suitable hashes. The things that you normally use to encrypt passwords one way.
Re: Bloom Filter or other mehod to store URL's?
by adrianh (Chancellor) on Apr 14, 2005 at 15:14 UTC
    I think the Bloom Filter is the best option here. Does anyone have a different/better option?

    One of those times I'd reach for a database. SQLite should cope with a few million rows quite happily.

Re: Bloom Filter or other mehod to store URL's?
by Random_Walk (Prior) on Apr 14, 2005 at 19:40 UTC

    This got me thinking about how much redundancy there is in urls, 90% start with www, and are .com, plenty of index.htmls out there, cgi-bin etc. If you split the url up and use each part as a key in a multilevel hash perl only stores each unique key once. Searching is fast and you will get no false possitives. Let the code speak ...

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %store; while (<DATA>) { next if /^\s*$/; chomp; s !//!/!g; s !:!!g; my @bits = split /\.|\//; my $key = ( join "}{", @bits); my $do_this = '$store{' . $key . '}++'; eval $do_this; } print Dumper(\%store); __DATA__ http://www.foo.com/this.html http://www.foo.com/index.html http://www.foo.com/that.html http://www.foo.com/cgi-bin/that.cgi http://www.foo.com/cgi-bin/user.cgi http://www.fred.com/index.html http://www.foo.org/index.html # output $VAR1 = { 'http' => { 'www' => { 'foo' => { 'org' => { 'index' => { + 'html' => +1 } }, 'com' => { 'index' => { 'h +tml' => 1 }, 'that' => { 'ht +ml' => 1 }, 'this' => { 'ht +ml' => 1 } } }, 'fred' => { 'com' => { 'index' => { ' +html' => 1 } } } } } };

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: Bloom Filter or other mehod to store URL's?
by BrowserUk (Patriarch) on Apr 14, 2005 at 18:42 UTC

    If you use the (binary) md5 of the url as the hash key and don't assign anything as the value (ie. use  undef $hash{ md5( $url ) }; to autovivify the key), then storing 10 million urls will require around 1 GB of ram.

    If you preallocated enough buckets (keys %hash =  2 **24;), then it runs pretty quickly too.

    There is the rare possibility that you will get a false positive by finding two urls that hash to the same md5, but the chances are less than with a bloom filter and if you are using md5 for your Bloom::Filter solution, that would be possible anyway.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re: Bloom Filter or other mehod to store URL's?
by Anonymous Monk on Apr 14, 2005 at 15:37 UTC
    So, you have 10 million web pages, but you don't have 10 Gb of disk space to spare? I dunno how much you are costing your company, but I'd be surprised you can come up with a good enough Bloom filter in such a short time that it costs less to implement that, than it costs to buy an extra disk. (It's just going to be scratch space, doesn't need to backup, so the costs of the extra disk space aren't much more than just the disk).
      I know i should not feed the trolls, but i cannot resist so here it comes:
      You assume the following:
      • The solution that takes 10GB is done in 0 time or time much less than the Bloom Filter
      • I do this for a boss who pays me to do it and wants me to work as cheap as possible
      • It will take a lot of time to make the Bloom Filter solution
      These assumptions are incorrect because of the following:
      • Typing use DBI; takes about as much time as typing use Bloom::Filter
      • As stated in the OP, i do this for fun & excercise
      • using Bloom::Filter, implementing this will be a snap
Re: Bloom Filter or other mehod to store URL's?
by jdporter (Paladin) on Apr 14, 2005 at 14:10 UTC
    Hash, definitely. I wouldn't even think about any other alternative, unless and until the memory consumption becomes an issue. Even then, there are probably tied hash classes that store their data on disk rather than in memory.
      Hmmm... our intranet server hosts about 10 million pages. If i do a no-brain hash, it quickly fills my 1 GB of mem. Even on disk a 10 GB hash file doesn't sound nice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://447781]
Approved by ghenry
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-25 20:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found