Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re^3: Advice on Efficient Large-scale Web Crawling

by matija (Priest)
on Dec 19, 2005 at 14:57 UTC ( #517740=note: print w/replies, xml ) Need Help??

in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

I suggest you simply change that to two hex digits per directory name, e.g.


That should reduce the average number of files per directory to a much more reasonable 60 and change.

And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

  • Comment on Re^3: Advice on Efficient Large-scale Web Crawling

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://517740]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2022-01-20 04:22 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (56 votes). Check out past polls.