http://qs321.pair.com?node_id=517740


in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

I suggest you simply change that to two hex digits per directory name, e.g.

pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69

That should reduce the average number of files per directory to a much more reasonable 60 and change.

And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.