Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: Advice on Efficient Large-scale Web Crawling

by matija (Priest)
on Dec 19, 2005 at 14:57 UTC ( [id://517740]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

I suggest you simply change that to two hex digits per directory name, e.g.

pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69

That should reduce the average number of files per directory to a much more reasonable 60 and change.

And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

  • Comment on Re^3: Advice on Efficient Large-scale Web Crawling

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://517740]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-16 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found