Parallel-processing the code

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parallel-processing the code by ikegami (Patriarch) on May 17, 2018 at 00:52 UTC
As previously mentioned, multi-tasking won't help. This prorgram is I/O-bound, and merging the results of the threads would be as expensive as building the results in the first place. I just wanted to provide a cleaned up version of your code (with a few micro-optimizations). `#!/usr/bin/perl use strict; use warnings; use feature qw( say ); use Sort::Naturally qw( nsort ); local $/ = ''; # Paragraph mode reads until a blank line. my %grouped; while (<>) { my @lines = split /\n/, $_; for (@lines[1..$#lines]) { next if $_ eq '--'; # Omit if rarely true. ++$grouped{"D$1"}{$2} while /\bD\([^])\([^])\*D\b/g; } } for my $k (nsort keys %grouped) { say "$k=>" . join("\|", keys(%{ $grouped{$k} }); }` [download]	[reply] [d/l]
Re^2: Parallel-processing the code by rajaman (Sexton) on May 19, 2018 at 04:12 UTC
Thank you Ikegami. New thing learned there.	[reply]
Re: Parallel-processing the code by marioroy (Prior) on May 17, 2018 at 04:23 UTC
Hi rajaman, Hello :) Unfortunately, life is getting shorter and have learned to skip threads like this one whenever test data is omitted. The reason is due to lack of time. Sorry. That said, the demonstration that follows is not tested. #!/usr/bin/perl use strict; use warnings; use Data::Dumper qw(Dumper); use re::engine::RE2; use List::MoreUtils qw(uniq); use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Make gather routine for the manager process. It returns a # closure block for preserving append-order as if processing # serially. my %hashunique; sub make_gather { my ($order_id, %tmp) = (1); return sub { my ($chunk_id, $hashref) = @_; $tmp{$chunk_id} = $hashref; while (exists $tmp{$order_id}) { $hashref = delete $tmp{$order_id}; for my $k (keys %{ $hashref }) { unless (exists $hashunique{$k}) { $hashunique{$k} = $hashref->{$k}; } else { $hashunique{$k} = $hashunique{$k}.'\|'.$hashref->{$ +k}; } } $order_id++; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @one = split /\n/, $_; my ($indexofdashinarray) = grep { $one[$_] =~ /\-\-/ } 0..$#on +e; for my $i (1..$#one) { next if $one[$i] =~ /^\-\-$/; while ($one[$i] =~ m/(\b)D\(.?)\(.?)\D(\b)/g) { unless (exists $localunique{"D$2"}) { $localunique{"D$2"} = "$3"; } else { $localunique{"D$2"} = $localunique{"D$2"}.'\|'."$3" +; } } } } close RF; # Each worker must call gather one time when preserving order # is desired which is the case for this demonstration. MCE->gather($chunk_id, \%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 3, input_data => $inputfile1, chunk_size => 2 1024 * 1024, # 2 MiB RS => '', # important, blank line, paragraph break gather => make_gather(), user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join ("\\|", uniq split /\\|/ , $hashunique{$k}); print WF "$k=>$hashunique{$k}\n"; } close WF; [download] Regards, Mario	[reply] [d/l]
Re^2: Parallel-processing the code by rajaman (Sexton) on May 17, 2018 at 19:34 UTC
Thanks very much Mario and others for your valuable input. I tried running your code, but it is generating blank output. I am appending below input and output file formats: In input file there are over 1000000 chunks of sentences (e.g. user review), with chunks separated by a blank line (shown below). I am trying to extract some pre-tagged patterns from the sentences. Such as, extract DID1Spore1 gameD from sentence and then separate ID of the game from its name; all names later are concatenated as shown in the output format below. Please let me know how your MCE-based code needs to be modified. Thanks once again. Input file format: 1 -- A new DVD with both the PC and Mac release for EA's DID1Spore1 gameD. DID2Spore2D is not that type of game. That is why I gave DID1Spore1D a 3 star. 2 -- DID2Spore2D is a wonderful game. A new DVD with both the PC and Mac release for EA's DID1Spore1D. 3 -- Once you get the DID1spore1*D cursor on your screen, click command-Q. . . Output format: ID1=>Spore1 game\|Spore1\|spore1 #case sensitive unique names only in hash value ID2=>Spore2	[reply]
Re^3: Parallel-processing the code by marioroy (Prior) on May 18, 2018 at 04:13 UTC
Hi rajaman, I am appending below input and output file formats... Great! I made two demonstrations entirely hash-key driven (2-levels). The serial code, based on ikegami's demonstration, may be fast enough for your use case. The parallel demonstration may run two times faster or more. Gather order is not necessary. Be sure to have Sereal installed for maximum performance. Both demonstrations produce the same output. Serial Code #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique; open RF, "<", $inputfile1 or die "Can't open $inputfile1: $!"; local $/ = ''; # blank line, paragraph break while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$#line +s; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\(.?)\(.?)\D(?:\b)/g) { $hashunique{"D$1"}{$2} = undef; } } } close RF; # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '\|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF; [download] Parallel Code* #!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Gather routine for the manager process. my %hashunique; sub gather { my ($hashref) = @_; for my $k1 (keys %{$hashref}) { for my $k2 (keys %{$hashref->{$k1}}) { $hashunique{$k1}{$k2} = undef; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$# +lines; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\(.?)\(.?)\D(?:\b)/g) { $localunique{"D$1"}{$2} = undef; } } } close RF; # Call gather outside the loop. MCE->gather(\%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 4, input_data => $inputfile1, chunk_size => 1 1024 * 1024, # 1 MiB RS => '', # important, blank line, paragraph break gather => \&gather, user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '\|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF; [download] Regards, Mario	[reply] [d/l] [select]
Re^4: Parallel-processing the code by rajaman (Sexton) on May 19, 2018 at 04:13 UTC
Re: Parallel-processing the code by Marshall (Canon) on May 17, 2018 at 00:17 UTC
Your application doesn't appear to be well suited for parallel processing. That's because you have a single input stream, what appears to be minimal processing and a single output DB that each thread or process would have to interact heavily with. I get from your question that the root problem is "I want my application to run faster". You assume that multi-processing is the answer for that and you are asking us how to do it. Let's back up a bit. Without sample data, it is a bit hard for me to figure out exactly what you are doing. How long does this app take? How big is the input file? What triggers a "new run"? It could be that if say 90% of the data is the same between runs and 10% is "new", some strategy that saves the results from the 90% that was the same from last time will result in getting the complete results faster? Anyway before jumping to a "solution", I'd like to understand a bit more about what you are doing... It could also be that some relatively minor tweaks to your code could provide some performance increase, although I doubt anything truly dramatic. But for example, I have one application that calculates some results hourly in a very efficient manner. At the end of the fiscal year, we recalculate everything and it takes about 6 hours. Nobody cares that one year's of data takes 6 hours to process because that work is spread out in very small increments on an hourly basis throughout the year. Update: we do the complete reprocessing as a "double check" on the incremental process. In theory the results are the same.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


P is for Practical
	PerlMonks