Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: What is the fastest way to download a bunch of web pages?

by tphyahoo (Vicar)
on Mar 03, 2005 at 13:38 UTC ( #436199=note: print w/replies, xml ) Need Help??


in reply to Re: What is the fastest way to download a bunch of web pages?
in thread What is the fastest way to download a bunch of web pages?

6 seconds, definitely faster. I'm almost done following up on inman's tip, then I'll report whether his way was faster on my box. The difference seems to be that you restricted yourself to three threads, whereas he had no restriction.

Anyway, thanks.

  • Comment on Re^2: What is the fastest way to download a bunch of web pages?

Replies are listed 'Best First'.
Re^3: What is the fastest way to download a bunch of web pages?
by BrowserUk (Patriarch) on Mar 03, 2005 at 13:45 UTC
    The difference seems to be that you restricted yourself to three threads,

    Just add -THREADS=10 to the command line.

    Try varying the number 2/3/5/10 and see what works best for you. With my connection, the throughput is purely down to the download speed, but if you are on broadband, the network latency may come into play. Chossing the right balance of simultaneous requests versus bandwidth is a suck-it-and-see equation. It will depend on a lot of things including time of day, locations etc.

    You can also use -PATH=tmp/ to tell it wher to put the files.

    You really need to be doing more than 10 sites for a reasonable test anyway.


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
Re^3: What is the fastest way to download a bunch of web pages?
by inman (Curate) on Mar 03, 2005 at 17:11 UTC
    he had no restriction was due to personal laziness rather than an optimised answer. BrowserUK's solution is more engineered since it allocates a thread pool (with a variable number of threads) and therefore manages the total amount of traffic being generated at any one time.

    Let's say for example that you were trying to download 100 pages from the same website. My solution would batter the machine at the other and effectively be a denial of service attack. The thread pool managed approach allows you to tune your network use.

    There's more than one way to do it (and the other guy did it better!)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436199]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2023-03-28 15:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (67 votes). Check out past polls.

    Notices?