Yet another example to get URLs in parallel

The role

1.12

Please note that this version contains some annoying ~~errors~~ mistakes. Use 1.17 instead. See the explanations from marioroy below in this thread.

package MyRole;

#     $Id: MyRole.pm,v 1.12 2017/06/17 14:00:17 karl Exp karl $

use Role::Tiny;
use threads;
use MCE::Loop;
use MCE::Shared;
use MCE::Mutex;
use WWW::Curl::Easy;
use Config::Tiny;

my $cfg = Config::Tiny->read(q(MyRole.cfg));

MCE::Loop::init {
    max_workers => $cfg->{params}->{workers},
    chunk_size  => 1,
    interval    => $cfg->{params}->{interval},
};

my $fetch = sub {
    my $curl = WWW::Curl::Easy->new;
    my ( $header, $body );
    $curl->setopt( CURLOPT_URL,            shift );
    $curl->setopt( CURLOPT_WRITEHEADER,    \$header );
    $curl->setopt( CURLOPT_WRITEDATA,      \$body );
    $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca
+tion} );
    $curl->setopt( CURLOPT_TIMEOUT,        $cfg->{params}->{timeout} )
+;
    $curl->perform;
    {
        header => $header,
        body   => $body,
        info   => $curl->getinfo(CURLINFO_HTTP_CODE),
        error  => $curl->errbuf,
    };
};

sub uagent {
    my $urls   = $_[1];
    my $shared = MCE::Shared->hash;
    my $mutex  = MCE::Mutex->new;
    mce_loop {
        MCE->yield;
        $mutex->enter( $shared->set( $_ => $fetch->($_) ) );
    }
    $urls;
    my $iter = $shared->iterator();
    my $result;
    while ( my ( $url, $data ) = $iter->() ) {
        $result->{$url} = $data;
    }
    $result;
}

1;

__END__
[download]

1.17

package MyRole;

#     $Id: MyRole.pm,v 1.17 2017/06/18 08:45:19 karl Exp karl $

use Role::Tiny;
use threads;
use MCE::Loop;
use MCE::Shared;
use WWW::Curl::Easy;
use Config::Tiny;

my $cfg = Config::Tiny->read(q(MyRole.cfg));

MCE::Loop::init {
    max_workers => $cfg->{params}->{workers},
    chunk_size  => 1,
    interval    => $cfg->{params}->{interval},
};

my $fetch = sub {
    my $curl = WWW::Curl::Easy->new;
    my ( $header, $body );
    $curl->setopt( CURLOPT_URL,            shift );
    $curl->setopt( CURLOPT_WRITEHEADER,    \$header );
    $curl->setopt( CURLOPT_WRITEDATA,      \$body );
    $curl->setopt( CURLOPT_FOLLOWLOCATION, $cfg->{params}->{followloca
+tion} );
    $curl->setopt( CURLOPT_TIMEOUT,        $cfg->{params}->{timeout} )
+;
    $curl->perform;
    {
        header => $header,
        body   => $body,
        info   => $curl->getinfo(CURLINFO_HTTP_CODE),
        error  => $curl->errbuf,
    };
};

sub uagent {
    my $urls   = $_[1];
    my $shared = MCE::Shared->hash;
    mce_loop {
        MCE->yield;
        $shared->set( $_ => $fetch->($_) );
    }
    $urls;
    $shared->export;
}

1;

__END__
[download]

The config file

#    $Id: MyRole.cfg,v 1.4 2017/06/17 13:48:19 karl Exp karl $    

[params]
timeout=10
followlocation=1
interval=0.005
workers=auto
[download]

The class

#     $Id: MyClass.pm,v 1.5 2017/06/16 15:35:32 karl Exp karl $
package MyClass;
use Class::Tiny;
use Role::Tiny::With;
with qw(MyRole);
1;

__END__
[download]

The app

#!/usr/bin/env perl
#     $Id: run.pl,v 1.14 2017/06/17 14:43:57 karl Exp karl $

use strict;
use warnings;
use MyClass;
use Data::Dump;
use HTML::Strip::Whitespace qw(html_strip_whitespace);
use feature qw(say);

my @urls = grep { $_ ne "" } <DATA>;

chomp @urls;

my $object = MyClass->new;

my $result = $object->uagent( \@urls );

# dd $result;

while ( my ( $url, $data ) = each %$result ) {
    say qq($url);
    say $data->{header};
#     my $html;
#     html_strip_whitespace(
#         'source' => \$data->{body},
#         'out'    => \$html
#     );
#     say $html;
}

__DATA__
http://fantasy.xecu.net
http://perlmonks.org
http://stackoverflow.com
http://www.trumptowerny.com
http://www.maralagoclub.com
http://www.sundialservices.com
[download]

Update: Fixed mistakes. Thank you marioroy.

Update2: Deleted unused module.

Best regards, Karl

ŤThe Crux of the Biscuit is the Apostropheť

Furthermore I consider that Donald Trump must be impeached as soon as possible

Comment on Yet another example to get URLs in parallel Select or Download Code

Replies are listed 'Best First'.
Re: Yet another example to get URLs in parallel by marioroy (Prior) on Jun 17, 2017 at 18:08 UTC
Hi karlgoethebier, Let's imagine for a minute, the following statement. `$mutex->enter( $shared->set( $_ => $fetch->($_) ) ); 1. the worker enters a mutex meaning one worker runs solo while inside the mutex 2. then does a fetch on given URL 3. then stores the result into a shared hash 4. finally, leaves the mutex` [download] The statement above is causing MCE workers to run serially, not parallel. I've gone back to your earlier example here and that looks fine. However for this thread, maybe running solo is what karlgoethebier intended and respecting his decision to do so. Surely, he wanted the code to run parallel ;-). `mce_loop { MCE->yield; # run parallel my $url = $_; my $result = $fetch->($url); # run solo to store the result $mutex->enter( $shared->set( $url => $result ) ); # am back to running parallel # ... }` [download] A mutex isn't needed when IPC involves a single trip, typical for the OO interface. `mce_loop { MCE->yield; # run parallel, without a mutex $shared->set( $_ => $fetch->($_) ); }` [download] A mutex is often necessary for a shared hash when constructed via the TIE interface. `tie my %hash, 'MCE::Shared'; my $shared = MCE::Shared->hash(); my $mutex = MCE::Mutex->new(); $hash{number} = 0; # 1 trip, store $shared->set( number => 0 ); # 1 trip # 2 trips fetch and store, needs a mutex $mutex->enter( $hash{number} += 2 ); # 1 trip via the OO interface $shared->incrby( number => 2 );` [download] Regards, Mario	[reply] [d/l] [select]
Re: Yet another example to get URLs in parallel by marioroy (Prior) on Jun 17, 2017 at 17:16 UTC
Hi karlgoethebier, I want to share an optimization for extracting the results from the shared-manager. Iterating and fetching keys individually from a shared-hash is not necessary after running parallel. `my $iter = $shared->iterator(); my $result; while ( my ( $url, $data ) = $iter->() ) { $result->{$url} = $data; } $result;` [download] All that IPC behind the scene may be reduced to a single call. `# export to a non-shared MCE::Shared::Hash object my $result = $shared->export( ); # or simply return an unblessed hash return $shared->export( { unbless => 1 } ); # or export-destroy the shared object from the shared-manager # because, the shared hash isn't needed once parallel is completed return $shared->destroy( { unbless => 1 } );` [download] Our fellow brother 1nickt is who requested for the unbless option. Thank you, 1nickt. Regards, Mario	[reply] [d/l] [select]

Back to Cool Uses for Perl