http://qs321.pair.com?node_id=1107565


in reply to Trying to Understand the Discouragement of Threads

I find that threads are often misunderstood, therefore misused.   People often suppose that threads somehow multiply the CPU resource, when in fact they just divide it.   They also often discount the additional load that might be placed upon the I/O subsystem, which, in spite of good cache buffering, still can only accomplish so-many reads/writes per second.   Probably the most common abuse is what I’ve dubbed the “flaming arrows approach.”   Each time another request comes in, you shoot another flaming arrow into the air and then just hope for the best.   Of course, this can lead to an un-throttled number of threads, all competing for the same resource ... “thrashing.”

The classic “thrash curve” is elbow-shaped, and the point at which the elbow goes straight-up is called “hitting the wall.”   The sweet-spot is just at the point where the elbow starts to curve up, but you have to have some kind of governor mechanism to hold it there.   The simplest way to accomplish this is with the Unix xargs command and the -p maxprocs option., using this to run a limited number of processes that are not each thread-aware.   The key concept is that the number of worker-threads is not identical to the number of units-of-work that the pool of workers is given to process, and that the number of threads can be regulated.

Your particular application is an ideal application for threading, if you can properly control the number of threads vs. the number of web-pages that need to be scraped.   You know that each thread is communicating (probably) to a different server, along a different Internet network-route, so there will be a rich “mix” of completion times for each request, with a moderate amount of local disk I/O needed to file-away each request and a negligible amount of RAM footprint.   Because of the random completion times, serious competition for disk-drive time is unlikely.   A pool of threads will be able to take advantage of that naturally self-regulating workload ... especially if you do distribute the workload with some consideration as to which URLs are being pinged.   (A “pseudo-random pick” from a moderately-sized pool of URLs-to-be-scraped would be a simple strategy to use here.)   Like a field of hitters swatting balls toward the outfield, you will naturally have many balls in the air at once.   Because some are pop-flys and others are line-drives, the outfielders can catch them all easily.