Speeding things up

AlwaysSurprised has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Speeding things up by huck (Prior) on Apr 15, 2018 at 01:30 UTC
You have a "double pump" going on there. `my $req = HTTP::Request->new(GET => $targetURL); my $res = $ua->request($req); if ($res->is_success) { my $website_content = $res->decoded_content; ... }` [download]	[reply] [d/l]
Re^2: Speeding things up by Anonymous Monk on Apr 15, 2018 at 10:06 UTC
Yes I know but didn't know of another way of pulling down a URL, checking response code and grabbing the content in a oner.	[reply]
Re: Speeding things up by marto (Cardinal) on Apr 15, 2018 at 09:12 UTC
"I haven't programmed in perl since the last century and things have moved on" They sure have. You could consider using something like this (Mojo:::UserAgent/Mojo::DOM based). Also, you'll want to ensure you're not violating the terms or service/use. Sites the size/nature of Argos will likely automatically block/throttle you for such rapid fire requests.	[reply]
Re^2: Speeding things up by Anonymous Monk on Apr 15, 2018 at 13:14 UTC
Thanks. I'll explore the Mojo thingy to minimise thumps. I'd hope that 100 hits a minute to a site that size wouldn't trigger and DDoS alarms but you're right, I should have a look at their robot policy.	[reply]
Re^3: Speeding things up by marto (Cardinal) on Apr 15, 2018 at 13:34 UTC
I find the mojo route quick, easy to code and maintain, selectors make this sort of thing trivial. Post a reply if you have problems, I'll take a look.	[reply]
Re: Speeding things up -- LWP::Parallel by Discipulus (Canon) on Apr 15, 2018 at 16:31 UTC
Hello AlwaysSurprised and welcome to the monastery and (back!) to the wonderful world of perl! do you want speed? parallelize your program! while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP::* modules are not thread safe. But there is LWP::Parallel or LWP::Parallel::UserAgent if your connection is fast you must notice a big improvement. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: Speeding things up -- LWP::Parallel by marioroy (Prior) on Apr 29, 2018 at 10:43 UTC
Hi Discipulus, ... while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP:: modules are not thread safe ...* An event-type module is typically preferred for this use-case. However, I took a look and tracked this down. See Re: Crash with ForkManager on Windows. When running parallel using LWP::, it is essential to load IO::Handle and a couple Net:: modules before spawning workers. The latest MCE and MCE::Shared (MCE::Hobo) updates do this automatically if present, LWP::UserAgent. `use LWP::Simple; # Pre-load essential modules for extra stability. if ( $INC{'LWP/UserAgent.pm'} && !$INC{'Net/HTTP.pm'} ) { require IO::Handle; require Net::HTTP; require Net::HTTPS; }` [download] Regards, Mario	[reply] [d/l]
Re^2: Speeding things up -- LWP::Parallel by AlwaysSurprised (Novice) on Apr 15, 2018 at 21:59 UTC
Some fiddling has been done. LWP & HTTP replaced with MOJO. I've now gone from ~50 sec to exmamine 100 pages to 16 sec aka 2 URL/s to 6 URL/s. I wonder how fast you have to hit Argos before it starts to think it's a DDoS? Anyone got any ideas how I can find out? And I don't mean bash it hard until it squeaks. That's just rude.	[reply]
Re^2: Speeding things up -- LWP::Parallel by mr_ron (Chaplain) on Apr 30, 2018 at 00:48 UTC
I don't have much experience with `Mojo::UserAgent`, but the Mojo Cookbook has examples for both non-blocking and blocking concurrency. For a simple test I got blocking concurrency with promises working. I built a small test server to avoid any webmaster complaints and easily read and did a little parsing on about 100 fetched URLs per second. I put a 1 second delay into page delivery and performance seemed to depend on web server configuration. To get good performance I configured for 100 workers restricted to 1 connection each. So with a server named `dos-serve-sp.pl` I ran: `./dos-serve-sp.pl prefork -w 100 -c 1` [download] Test server: #!/usr/bin/env perl use Modern::Perl; use Mojolicious::Lite; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; # cache page to echo my $res = $ua->get('www.software-path.com')->result; if ($res->is_error) { say $res->message } elsif (not $res->is_success) { die "Unknown response from Mojo::UserAgent" } $res->dom->at('head')->append_content( '<base href="http://www.software-path.com">' ); get '/' => sub { my ($c) = @_; sleep 1; $c->render(text => $res->dom->to_string, format => 'html'); }; app->start; [download] Test client with blocking concurrency from promises: #!/usr/bin/env perl use Modern::Perl; use Mojo::UserAgent; my @all_url = ('http://127.0.0.1:3000/') x 200; my $concurrent_load = 100; my $ua = Mojo::UserAgent->new; while (@all_url) { my @concurrent_read = map { $ua->get_p($_)->then(sub { my $tx = shift; my $result = $tx->result; if ($result->is_success) { say $result->dom->at('title')->text } else { say $result->is_error ? $result->message : "Unknown response from Mojo::UserAgent"; } }) # end ->then sub } splice @all_url, 0, $concurrent_load; Mojo::Promise->all(@concurrent_read)->wait; } [download] Ron	[reply] [d/l] [select]
Re^3: Speeding things up -- LWP::Parallel by AlwaysSurprised (Novice) on Jun 16, 2018 at 23:54 UTC
Although 100/sec sounds fun, I rather think it would look like some sort of feeble DDoS attack and get my IP blocked. I've read 5/sec is considered high by some spider writers. Apparently you can register your site with Google and set a parameter to limit it's strike rate though sometimes the Google spider just ignores it. I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages. I do wonder what a polite rate is though; fast enough so that old results are still timely but slow enough to not be annoying.	[reply]
Re^4: Speeding things up -- LWP::Parallel by hippo (Bishop) on Jun 17, 2018 at 09:35 UTC
Re: Speeding things up by karlgoethebier (Abbot) on Apr 16, 2018 at 11:46 UTC
"...I'm on a fast connection. Is that the bottle neck? Can I get it going significantly faster?" You might take a look at Yet another example to get URLs in parallel for some further inspiration. Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]