Re^3: Advice on Efficient Large-scale Web Crawling


Keep It Simple, Stupid
	PerlMonks

Re^3: Advice on Efficient Large-scale Web Crawling

by matija (Priest)

on Dec 19, 2005 at 14:57 UTC ( [id://517740]=note: print w/replies, xml )

Need Help??

in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

I suggest you simply change that to two hex digits per directory name, e.g.

pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69

That should reduce the average number of files per directory to a much more reasonable 60 and change.

And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

Comment on Re^3: Advice on Efficient Large-scale Web Crawling

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://517740]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others meditating upon the Monastery: (3)

As of 2024-04-16 10:24 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found