Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Speeding things up

by AlwaysSurprised (Novice)
on Apr 14, 2018 at 23:42 UTC ( [id://1212899]=perlquestion: print w/replies, xml ) Need Help??

AlwaysSurprised has asked for the wisdom of the Perl Monks concerning the following question:

"I'll give you 10000 contiguous pages from Argos. You tell me the sum of the products" was the jist of the conversation.

Ok honoured Monks. I confess I haven't programmed in perl since the last century and things have moved one. Quills are no longer needed in my editor of choice.

So my thinking has been

my $req = HTTP::Request->new(GET => $targetURL); my $res = $ua->request($req);

to snort the response code, if it's a 200, then

my $website_content = get($targetURL);

and regex the info out of the page I'm after and add it to a file.

Did it, but it's taking about a minute to process 100 pages.

Oh wise ones, point out the errors in my thinking. I'm on a fast connection. Is that the bottle neck? Can I get it going significantly faster?

Replies are listed 'Best First'.
Re: Speeding things up
by huck (Prior) on Apr 15, 2018 at 01:30 UTC

    You have a "double pump" going on there.

    my $req = HTTP::Request->new(GET => $targetURL); my $res = $ua->request($req); if ($res->is_success) { my $website_content = $res->decoded_content; ... }

      Yes I know but didn't know of another way of pulling down a URL, checking response code and grabbing the content in a oner.

Re: Speeding things up
by marto (Cardinal) on Apr 15, 2018 at 09:12 UTC

    "I haven't programmed in perl since the last century and things have moved on"

    They sure have. You could consider using something like this (Mojo:::UserAgent/Mojo::DOM based). Also, you'll want to ensure you're not violating the terms or service/use. Sites the size/nature of Argos will likely automatically block/throttle you for such rapid fire requests.

      Thanks. I'll explore the Mojo thingy to minimise thumps.

      I'd hope that 100 hits a minute to a site that size wouldn't trigger and DDoS alarms but you're right, I should have a look at their robot policy.

        I find the mojo route quick, easy to code and maintain, selectors make this sort of thing trivial. Post a reply if you have problems, I'll take a look.

Re: Speeding things up -- LWP::Parallel
by Discipulus (Canon) on Apr 15, 2018 at 16:31 UTC
    Hello AlwaysSurprised and welcome to the monastery and (back!) to the wonderful world of perl!

    do you want speed? parallelize your program! while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP::* modules are not thread safe.

    But there is LWP::Parallel or LWP::Parallel::UserAgent if your connection is fast you must notice a big improvement.

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Hi Discipulus,

      ... while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP::* modules are not thread safe ...

      An event-type module is typically preferred for this use-case. However, I took a look and tracked this down. See Re: Crash with ForkManager on Windows. When running parallel using LWP::*, it is essential to load IO::Handle and a couple Net::* modules before spawning workers. The latest MCE and MCE::Shared (MCE::Hobo) updates do this automatically if present, LWP::UserAgent.

      use LWP::Simple; # Pre-load essential modules for extra stability. if ( $INC{'LWP/UserAgent.pm'} && !$INC{'Net/HTTP.pm'} ) { require IO::Handle; require Net::HTTP; require Net::HTTPS; }

      Regards, Mario

      Some fiddling has been done. LWP & HTTP replaced with MOJO. I've now gone from ~50 sec to exmamine 100 pages to 16 sec aka 2 URL/s to 6 URL/s.

      I wonder how fast you have to hit Argos before it starts to think it's a DDoS? Anyone got any ideas how I can find out? And I don't mean bash it hard until it squeaks. That's just rude.

      I don't have much experience with Mojo::UserAgent, but the Mojo Cookbook has examples for both non-blocking and blocking concurrency. For a simple test I got blocking concurrency with promises working. I built a small test server to avoid any webmaster complaints and easily read and did a little parsing on about 100 fetched URLs per second.

      I put a 1 second delay into page delivery and performance seemed to depend on web server configuration. To get good performance I configured for 100 workers restricted to 1 connection each. So with a server named dos-serve-sp.pl I ran:

      ./dos-serve-sp.pl prefork -w 100 -c 1

      Test server:

      #!/usr/bin/env perl use Modern::Perl; use Mojolicious::Lite; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; # cache page to echo my $res = $ua->get('www.software-path.com')->result; if ($res->is_error) { say $res->message } elsif (not $res->is_success) { die "Unknown response from Mojo::UserAgent" } $res->dom->at('head')->append_content( '<base href="http://www.software-path.com">' ); get '/' => sub { my ($c) = @_; sleep 1; $c->render(text => $res->dom->to_string, format => 'html'); }; app->start;

      Test client with blocking concurrency from promises:

      #!/usr/bin/env perl use Modern::Perl; use Mojo::UserAgent; my @all_url = ('http://127.0.0.1:3000/') x 200; my $concurrent_load = 100; my $ua = Mojo::UserAgent->new; while (@all_url) { my @concurrent_read = map { $ua->get_p($_)->then(sub { my $tx = shift; my $result = $tx->result; if ($result->is_success) { say $result->dom->at('title')->text } else { say $result->is_error ? $result->message : "Unknown response from Mojo::UserAgent"; } }) # end ->then sub } splice @all_url, 0, $concurrent_load; Mojo::Promise->all(@concurrent_read)->wait; }
      Ron

        Although 100/sec sounds fun, I rather think it would look like some sort of feeble DDoS attack and get my IP blocked. I've read 5/sec is considered high by some spider writers. Apparently you can register your site with Google and set a parameter to limit it's strike rate though sometimes the Google spider just ignores it.

        I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages.

        I do wonder what a polite rate is though; fast enough so that old results are still timely but slow enough to not be annoying.

Re: Speeding things up
by karlgoethebier (Abbot) on Apr 16, 2018 at 11:46 UTC
    "...I'm on a fast connection. Is that the bottle neck? Can I get it going significantly faster?"

    You might take a look at Yet another example to get URLs in parallel for some further inspiration.

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1212899]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-25 21:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found