Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parallel-processing the code

by rajaman (Sexton)
on May 16, 2018 at 18:36 UTC ( [id://1214682]=perlquestion: print w/replies, xml ) Need Help??

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am processing a text file (code shown below), chunk by chunk. For each text chunk, I do some processing shown in the while loop in the code below. I am collecting the result of the processing in a hash (%hashunique). How can I do parallel processing on this code. For example, run in parallel 10 instances of while loop, each processing 10 different chunks of text from the input file. At the end of processing, all results are saved in %hashunique.

I checked some modules, but could not figure out how to apply these on my code below.

Thanks a lot!

#!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally; #This program reads abstract sentence file and produces output with th +e following format: # if ($#ARGV != 1) { print "usage: program arguments\n"; } my $inputfile1=$ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique=(); open(RF, "$inputfile1") or die "Can't open < $inputfile1: $!"; open(WF, ">$outputfile"); #open for output $/ = ''; #this sets the delimiter for an empty line while (<RF>) { my @one = split /\n/ , $_; my ( $indexofdashinarray ) = grep { $one[$_] =~ /\-\-/ } 0..$#one; for (my $i=0;$i<=$#one;$i++) { next if $i==0; next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\*(.*?)\*(.*?)\*D(\b)/g) { unless ($hashunique{"D$2"}) { $hashunique{"D$2"}="$3"; } else { $hashunique{"D$2"}=$hashunique{"D$2"}.'|'."$3"; } } } } foreach my $i (nsort keys %hashunique) { $hashunique{$i} = join ( "\|", uniq split /\|/ , $hashunique{$i}); print WF "$i=>$hashunique{$i}\n"; } close (RF); close (WF);

Replies are listed 'Best First'.
Re: Parallel-processing the code
by ikegami (Patriarch) on May 17, 2018 at 00:52 UTC

    As previously mentioned, multi-tasking won't help. This prorgram is I/O-bound, and merging the results of the threads would be as expensive as building the results in the first place.

    I just wanted to provide a cleaned up version of your code (with a few micro-optimizations).

    #!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Sort::Naturally qw( nsort ); local $/ = ''; # Paragraph mode reads until a blank line. my %grouped; while (<>) { my @lines = split /\n/, $_; for (@lines[1..$#lines]) { next if $_ eq '--'; # Omit if rarely true. ++$grouped{"D$1"}{$2} while /\bD\*([^*]*)\*([^*]*)\*D\b/g; } } for my $k (nsort keys %grouped) { say "$k=>" . join("|", keys(%{ $grouped{$k} }); }
      Thank you Ikegami. New thing learned there.
Re: Parallel-processing the code
by marioroy (Prior) on May 17, 2018 at 04:23 UTC

    Hi rajaman,

    Hello :) Unfortunately, life is getting shorter and have learned to skip threads like this one whenever test data is omitted. The reason is due to lack of time. Sorry. That said, the demonstration that follows is not tested.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Make gather routine for the manager process. It returns a # closure block for preserving append-order as if processing # serially. my %hashunique; sub make_gather { my ($order_id, %tmp) = (1); return sub { my ($chunk_id, $hashref) = @_; $tmp{$chunk_id} = $hashref; while (exists $tmp{$order_id}) { $hashref = delete $tmp{$order_id}; for my $k (keys %{ $hashref }) { unless (exists $hashunique{$k}) { $hashunique{$k} = $hashref->{$k}; } else { $hashunique{$k} = $hashunique{$k}.'|'.$hashref->{$ +k}; } } $order_id++; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @one = split /\n/, $_; my ($indexofdashinarray) = grep { $one[$_] =~ /\-\-/ } 0..$#on +e; for my $i (1..$#one) { next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\*(.*?)\*(.*?)\*D(\b)/g) { unless (exists $localunique{"D$2"}) { $localunique{"D$2"} = "$3"; } else { $localunique{"D$2"} = $localunique{"D$2"}.'|'."$3" +; } } } } close RF; # Each worker must call gather one time when preserving order # is desired which is the case for this demonstration. MCE->gather($chunk_id, \%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 3, input_data => $inputfile1, chunk_size => 2 * 1024 * 1024, # 2 MiB RS => '', # important, blank line, paragraph break gather => make_gather(), user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join ("\|", uniq split /\|/ , $hashunique{$k}); print WF "$k=>$hashunique{$k}\n"; } close WF;

    Regards, Mario

      Thanks very much Mario and others for your valuable input.

      I tried running your code, but it is generating blank output.

      I am appending below input and output file formats: In input file there are over 1000000 chunks of sentences (e.g. user review), with chunks separated by a blank line (shown below). I am trying to extract some pre-tagged patterns from the sentences. Such as, extract D*ID1*Spore1 game*D from sentence and then separate ID of the game from its name; all names later are concatenated as shown in the output format below.

      Please let me know how your MCE-based code needs to be modified.

      Thanks once again.

      
      Input file format:
      1
      --
      A new DVD with both the PC and Mac release for EA's D*ID1*Spore1 game*D.
      D*ID2*Spore2*D is not that type of game.
      That is why I gave D*ID1*Spore1*D a 3 star.
      
      2
      --
      D*ID2*Spore2*D is a wonderful game.
      A new DVD with both the PC and Mac release for EA's D*ID1*Spore1*D.
      
      3
      --
      Once you get the D*ID1*spore1*D cursor on your screen, click command-Q.
      .
      .
      
      Output format:
      ID1=>Spore1 game|Spore1|spore1 #case sensitive unique names only in hash value
      ID2=>Spore2
      
      

        Hi rajaman,

        I am appending below input and output file formats...

        Great! I made two demonstrations entirely hash-key driven (2-levels). The serial code, based on ikegami's demonstration, may be fast enough for your use case. The parallel demonstration may run two times faster or more. Gather order is not necessary. Be sure to have Sereal installed for maximum performance.

        Both demonstrations produce the same output.

        Serial Code

        #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique; open RF, "<", $inputfile1 or die "Can't open $inputfile1: $!"; local $/ = ''; # blank line, paragraph break while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$#line +s; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $hashunique{"D$1"}{$2} = undef; } } } close RF; # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

        Parallel Code

        #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Gather routine for the manager process. my %hashunique; sub gather { my ($hashref) = @_; for my $k1 (keys %{$hashref}) { for my $k2 (keys %{$hashref->{$k1}}) { $hashunique{$k1}{$k2} = undef; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$# +lines; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $localunique{"D$1"}{$2} = undef; } } } close RF; # Call gather outside the loop. MCE->gather(\%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 4, input_data => $inputfile1, chunk_size => 1 * 1024 * 1024, # 1 MiB RS => '', # important, blank line, paragraph break gather => \&gather, user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

        Regards, Mario

Re: Parallel-processing the code
by Marshall (Canon) on May 17, 2018 at 00:17 UTC
    Your application doesn't appear to be well suited for parallel processing. That's because you have a single input stream, what appears to be minimal processing and a single output DB that each thread or process would have to interact heavily with.

    I get from your question that the root problem is "I want my application to run faster". You assume that multi-processing is the answer for that and you are asking us how to do it.

    Let's back up a bit. Without sample data, it is a bit hard for me to figure out exactly what you are doing. How long does this app take? How big is the input file? What triggers a "new run"? It could be that if say 90% of the data is the same between runs and 10% is "new", some strategy that saves the results from the 90% that was the same from last time will result in getting the complete results faster? Anyway before jumping to a "solution", I'd like to understand a bit more about what you are doing...

    It could also be that some relatively minor tweaks to your code could provide some performance increase, although I doubt anything truly dramatic.

    But for example, I have one application that calculates some results hourly in a very efficient manner. At the end of the fiscal year, we recalculate everything and it takes about 6 hours. Nobody cares that one year's of data takes 6 hours to process because that work is spread out in very small increments on an hourly basis throughout the year. Update: we do the complete reprocessing as a "double check" on the incremental process. In theory the results are the same.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1214682]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 21:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found