Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I find that threads are often misunderstood, therefore misused.   People often suppose that threads somehow multiply the CPU resource, when in fact they just divide it.   They also often discount the additional load that might be placed upon the I/O subsystem, which, in spite of good cache buffering, still can only accomplish so-many reads/writes per second.   Probably the most common abuse is what I’ve dubbed the “flaming arrows approach.”   Each time another request comes in, you shoot another flaming arrow into the air and then just hope for the best.   Of course, this can lead to an un-throttled number of threads, all competing for the same resource ... “thrashing.”

The classic “thrash curve” is elbow-shaped, and the point at which the elbow goes straight-up is called “hitting the wall.”   The sweet-spot is just at the point where the elbow starts to curve up, but you have to have some kind of governor mechanism to hold it there.   The simplest way to accomplish this is with the Unix xargs command and the -p maxprocs option., using this to run a limited number of processes that are not each thread-aware.   The key concept is that the number of worker-threads is not identical to the number of units-of-work that the pool of workers is given to process, and that the number of threads can be regulated.

Your particular application is an ideal application for threading, if you can properly control the number of threads vs. the number of web-pages that need to be scraped.   You know that each thread is communicating (probably) to a different server, along a different Internet network-route, so there will be a rich “mix” of completion times for each request, with a moderate amount of local disk I/O needed to file-away each request and a negligible amount of RAM footprint.   Because of the random completion times, serious competition for disk-drive time is unlikely.   A pool of threads will be able to take advantage of that naturally self-regulating workload ... especially if you do distribute the workload with some consideration as to which URLs are being pinged.   (A “pseudo-random pick” from a moderately-sized pool of URLs-to-be-scraped would be a simple strategy to use here.)   Like a field of hitters swatting balls toward the outfield, you will naturally have many balls in the air at once.   Because some are pop-flys and others are line-drives, the outfielders can catch them all easily.

In reply to Re: Trying to Understand the Discouragement of Threads by sundialsvc4
in thread Trying to Understand the Discouragement of Threads by benwills

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2023-11-29 02:09 GMT
Find Nodes?
    Voting Booth?

    No recent polls found