Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Trying to Understand the Discouragement of Threads

by sundialsvc4 (Abbot)
on Nov 18, 2014 at 13:07 UTC ( [id://1107565]=note: print w/replies, xml ) Need Help??


in reply to Trying to Understand the Discouragement of Threads

I find that threads are often misunderstood, therefore misused.   People often suppose that threads somehow multiply the CPU resource, when in fact they just divide it.   They also often discount the additional load that might be placed upon the I/O subsystem, which, in spite of good cache buffering, still can only accomplish so-many reads/writes per second.   Probably the most common abuse is what I’ve dubbed the “flaming arrows approach.”   Each time another request comes in, you shoot another flaming arrow into the air and then just hope for the best.   Of course, this can lead to an un-throttled number of threads, all competing for the same resource ... “thrashing.”

The classic “thrash curve” is elbow-shaped, and the point at which the elbow goes straight-up is called “hitting the wall.”   The sweet-spot is just at the point where the elbow starts to curve up, but you have to have some kind of governor mechanism to hold it there.   The simplest way to accomplish this is with the Unix xargs command and the -p maxprocs option., using this to run a limited number of processes that are not each thread-aware.   The key concept is that the number of worker-threads is not identical to the number of units-of-work that the pool of workers is given to process, and that the number of threads can be regulated.

Your particular application is an ideal application for threading, if you can properly control the number of threads vs. the number of web-pages that need to be scraped.   You know that each thread is communicating (probably) to a different server, along a different Internet network-route, so there will be a rich “mix” of completion times for each request, with a moderate amount of local disk I/O needed to file-away each request and a negligible amount of RAM footprint.   Because of the random completion times, serious competition for disk-drive time is unlikely.   A pool of threads will be able to take advantage of that naturally self-regulating workload ... especially if you do distribute the workload with some consideration as to which URLs are being pinged.   (A “pseudo-random pick” from a moderately-sized pool of URLs-to-be-scraped would be a simple strategy to use here.)   Like a field of hitters swatting balls toward the outfield, you will naturally have many balls in the air at once.   Because some are pop-flys and others are line-drives, the outfielders can catch them all easily.

  • Comment on Re: Trying to Understand the Discouragement of Threads

Replies are listed 'Best First'.
Re^2: Trying to Understand the Discouragement of Threads
by BrowserUk (Patriarch) on Nov 18, 2014 at 13:49 UTC

    You have proved time and time again that you don't understand threading; indeed, everything you've ever posted on the subject -- which has never included a single line of code -- has been proven wrong.

    So, just stop talking; before you make your already totally tattered reputation even worse.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      And such a vitriolic comment profited this discussion how, exactly?   Honestly, just cast your obligatory down-vote, as you customarily do, and leave it at that.   Please keep your personal opinions to yourself.   (On the other hand, I happen to strongly agree with the positive compliment that you were very rightly paid in the OP.)

      Threads are a misunderstood and thus often-misused feature, no matter what language is being talked-about.   The OP hit upon a textbook example of where threading is particularly well-suited, and obtained great results.   All of his program’s requests were being served by another well-designed application of threading ... Apache.   But how many times have we seen, even right here, situations where people fired off “one thread per request, regardless of transaction volume,” and wondered (publicly) why their server was being brought to its knees thereby?   A good design in a suitable situation works consistently well, whereas one that is permitted to “hit the wall” is disastrously-bad (and negatively impacts the system as a whole).   (In some cases it is literally a “fork bomb.”)

        And such a vitriolic comment profited this discussion how, exactly?

        By warning other readers that haven't yet seen enough of your posts, (ie. less than 5), to have worked out for themselves what a total waste of space, and dangerous waste of mindspace your utterings are; that just about everything you post is little more than the first vaguely related garbage that springs into your indiscriminate mind; and not worth the energy required to carry to it from your fingertips to this place.

        It is a public service I feel it my duty to perform.

        As long as you continue to post garbage on subjects we've proven time and again that you have no understanding of, I will continue to: a) downvote them; b) attach warning labels to them at quickly as I can.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1107565]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-18 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found