Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list

by marioroy (Prior)
on Oct 06, 2022 at 03:12 UTC ( [id://11147271]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

The following is a parallel demonstration based on Marshall's implementation.

#!/usr/bin/env perl # https://www.perlmonks.org/?node_id=11147200 use strict; use warnings; use MCE; use Time::HiRes 'time'; die "usage: $0 infile1.txt [ infile2.txt ... ]\n" unless @ARGV; # substitute whole word only my %W1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %W2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %W3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $RE1 = qr{\b(@{[ join '|', reverse sort keys %W1 ]})\b}i; my $RE2 = qr{\b(@{[ join '|', reverse sort keys %W2 ]})\w*}i; my $RE3 = qr{\b\w*?(@{[ join '|', reverse sort keys %W3 ]})\w*}i; my $OUT_FH; # output file-handle used by workers # Spawn worker pool. my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => '64K', init_relay => 0, # specifying init_relay loads MCE::Relay use_slurpio => 1, # enable slurpio user_begin => sub { # worker begin routine per each file to be processed my ($outfile) = @{ MCE->user_args() }; open $OUT_FH, '>>', $outfile; }, user_end => sub { # worker end routine per each file to be processed close $OUT_FH if defined $OUT_FH; }, user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; # process entire chunk versus line-by-line $$chunk_ref =~ tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; $$chunk_ref =~ s/w(as|ere)/be/gi; $$chunk_ref =~ s/$RE1/ $W1{lc $1} /g; $$chunk_ref =~ s/$RE2/ $W2{lc $1} /g; $$chunk_ref =~ s/$RE3/ $W3{lc $1} /g; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $$chunk_ref; $OUT_FH->flush; MCE->relay_unlock; } )->spawn; # Process file(s). my $status = 0; while (my $infile = shift @ARGV) { if (-d $infile) { warn "WARN: '$infile': Is a directory, skipped\n"; $status = 1; } elsif (! -f $infile) { warn "WARN: '$infile': No such file, skipped\n"; $status = 1; } else { my $outfile = $infile; $outfile =~ s/\.txt$/.dat/; if ($outfile eq $infile) { warn "WARN: '$outfile': matches input name, skipped\n"; $status = 1; next; } # truncate output file open my $fh, '>', $outfile or do { warn "WARN: '$outfile': $!, skipped\n"; $status = 1; next; }; close $fh; # process file; pass argument(s) to workers my $start = time(); $mce->process($infile, { user_args => [ $outfile ] }); my $total_seconds = time() - $start; my $minutes = int($total_seconds / 60); my $seconds = $total_seconds - ($minutes * 60); printf "file: $infile mins: $minutes secs: %0.3f\n", $second +s; } } $mce->shutdown; # reap workers exit $status;
  • Comment on Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list
  • Download Code

Replies are listed 'Best First'.
Re^6: Need to speed up many regex substitutions and somehow make them a here-doc list
by Marshall (Canon) on Oct 08, 2022 at 23:24 UTC
    I benchmarked your code with same input file that I was using for other tests with the following results:
    .../ManySubstitutions>perl mceparallel.pl nightfall.txt file: nightfall.txt mins: 0 secs: 5.763 file: nightfall.txt mins: 0 secs: 4.290 file: nightfall.txt mins: 0 secs: 4.179 file: nightfall.txt mins: 0 secs: 4.293
    So this about 3x the speed of my previous single threaded version (exec time was about 12 sec with that one). The slightly longer time seen for the first run is to be expected as this was after a clean reboot and Perl is not in memory yet, etc. I have 16 GB ram on this Win10 machine and after one run, a lot of the disk data winds up in the read cache.

    From what I gather, your MCE code processes things in 64K chuncks. How does it go about determining the boundary for a 64K hunk? Is there a chance that a word or a line could get split between chunks? I guess that there is some slight inefficiency because MCE has to enforce a sequential finishing - and again - I wasn't really sure how that sequencing is done/guaranteed?

    In the OP's situation, he/she says that there could be millions of files to process. We are just using a single file for benchmarking purposes. In practice, I would think that a scheme that processes say X streams of files simultaneously would be a more simple processing model. Where X is number of cores/cpus. My machine has 4 cores. So, I split the big input file into 4 smaller ones of approximately the same size and show some forking code below....

    I had expected this to be faster than the MCE version because there are no conflicts between forks (or actually virtual forks, i.e. threads on Windows). However that turned out not to be the case. I am a bit disappointed in the performance as I have seen other applications where I can get much closer to the theoretical but not reachable limit of 4x better. But ~3x isn't bad. This fork business is weird on Windows and this code may work a lot better on Unix which can do "real" forks.

    use strict; use warnings; use Fcntl qw(:flock); use File::Copy 'move'; use POSIX "sys_wait_h"; #for waitpid FLAGS use Time::HiRes 'time'; $|=1; my @in_files = qw(nightfall1.txt nightfall2.txt nightfall3.txt nightfa +ll4.txt ); my $start_epoch = time(); # Fire off number of child processes equal to the # number of files in @in_files; # Then the parent who started these little guys, # goes into a blocking wait until they all finish # In this simple example, they will finish at about the same time # because all the input files are roughly identical # Each child can return a status code via exit($code_number). # Each child writes its own output file, so the only real "bottleneck" # is max avg throughput of the file system. ############ ## Common code for all forked processes # # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #hal +f speed of \b version ######### ## Fork off the children # $SIG{CHLD} = 'IGNORE'; open(my $fh_log, '>>', "Alogfile.txt") or die "unable to open Alogfile +.txt $!"; #$fh_log->autoflush; #not needed this is automatic before locking or +unlocking a file! foreach my $file_name (@in_files) { if(my $pid = fork) { # parent safe_print ($fh_log, "Spawned child pid: $pid for $file_name\n +"); } elsif(defined $pid ) # pid==0 { # child safe_print ($fh_log, "This is child pid $$ for $file_name. I a +m alive and working!\n"); process_file($file_name); safe_print ($fh_log, "Child $$ finished work on $file_name\n") +; exit(0); } else { # fork failed pid undefined die "MASSIVE ERROR - FORK FAILED with $!"; } } ### now wait for all children to finish, no matter who they are 1 while wait != -1 ; # avoid zombies this is a blocking operation safe_print ($fh_log, "Parenting talking...all my children are finished +! Hooray!\n"); close $fh_log; sub safe_print { my ($fh, @text) = @_; my $now_epoch = time(); my $delta_secs = $now_epoch - $start_epoch; flock $fh, LOCK_EX or die "flock can't get lock $!"; printf $fh "%.3f secs %s", $delta_secs, $_ foreach @text; printf "%.3f secs %s", $delta_secs, $_ foreach @text; flock $fh, LOCK_UN or die "flock can't release lock $!"; } sub process_file { my $filename = shift; open my $IN, '<', $filename or die "can't open input $filename $!"; my $outfile = $filename; $outfile =~ s/\.txt$/\.out/; open my $OUT, '>', $outfile or die "can't open output $outfile $!"; safe_print ($fh_log, "opened $filename and $outfile\n"); while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; #no punct no dig +its s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; #this ~2-3 sec s{$re2}{ $w2{lc $1} }g; #this ~3 sec s{$re3}{ $w3{lc $1} }g; #this ~6 sec print $OUT "$_"; } close $IN; close $OUT; safe_print ($fh_log, "Child $$ finished working on $filename!\n"); exit(0); #CHILD has to exit itself! } __END__ 0.007 secs Spawned child pid: -1620 for nightfall1.txt 0.008 secs This is child pid -1620 for nightfall1.txt. I am alive and +working! 0.011 secs opened nightfall1.txt and nightfall1.out 0.013 secs Spawned child pid: -5660 for nightfall2.txt 0.014 secs This is child pid -5660 for nightfall2.txt. I am alive and +working! 0.017 secs opened nightfall2.txt and nightfall2.out 0.019 secs Spawned child pid: -20048 for nightfall3.txt 0.020 secs This is child pid -20048 for nightfall3.txt. I am alive and + working! 0.022 secs opened nightfall3.txt and nightfall3.out 0.029 secs Spawned child pid: -4840 for nightfall4.txt 0.031 secs This is child pid -4840 for nightfall4.txt. I am alive and +working! 0.034 secs opened nightfall4.txt and nightfall4.out 4.818 secs Child -4840 finished working on nightfall4.txt! 4.827 secs Child -1620 finished working on nightfall1.txt! 4.835 secs Child -5660 finished working on nightfall2.txt! 4.842 secs Child -20048 finished working on nightfall3.txt! 4.844 secs Parent talking...all my children are finished! Hooray!
    Note the last process to be started finished first, but the file sizes aren't completely equal and also there is some variability between runs depending upon how the O/S does the core assignment and what else is going on in the machine.
    File sizes for reference: 10/04/2022 06:44 PM 80,748,006 nightfall.txt 10/08/2022 01:30 PM 20,187,006 nightfall1.txt 10/08/2022 01:30 PM 20,187,078 nightfall2.txt 10/08/2022 01:30 PM 20,187,057 nightfall3.txt 10/08/2022 01:30 PM 20,186,865 nightfall4.txt
      How does it go about determining the boundary for a 64K hunk? Is there a chance that a word or a line could get split between chunks?

      Great question. Workers, after reading the chunk, continue reading till the end of record or end of line.

      I guess that there is some slight inefficiency because MCE has to enforce a sequential finishing - and again.

      The demonstration involves workers awaiting their turn to output orderly, which is by chunk_id behind the scene. The code between MCE->relay_lock and MCE->relay_unlock runs serially. Below, MCE results from a Fedora Linux box.

      use threads; # add line before loading MCE (to spin threads) 1 thread 12.473 seconds 4 threads 3.182 seconds 3.92x fork 1 process 7.331 seconds 4 processes 1.848 seconds 3.97x
      In the OP's situation, he/she says that there could be millions of files to process.

      You may have missed a post where I mentioned similarly.

      This fork business is weird on Windows and this code may work a lot better on Unix which can do "real" forks.

      Something to try is the MCE use_threads => 0 option. That will cause MCE to spin workers via the emulated fork.

      ... there is some variability between runs depending upon how the O/S does the core assignment and what else is going on in the machine.

      Dividing work equally by the number of workers may exhibit an unlikely abnormality. One doesn't know for sure if the regular expression aspect may run faster on the first part of the file or somewhere in between. It's possible for a worker to finish sooner than others, causing idled CPU time. This idled CPU time may be greater than the time MCE workers await their turn to output serially. I'm not seeing the abnormality for this workload.

      For reference, MCE above completes in 1.848 seconds (fork). One reason for completing faster is MCE code processing entire chunk versus line-by-line. For apples-to-apples comparison, I will update the MCE code to process line-by-line and report back.
      -rw-r--r--. 1 mario mario 80348000 Oct 9 22:15 nightfall.txt -rw-r--r--. 1 mario mario 20087000 Oct 9 22:12 nightfall1.txt -rw-r--r--. 1 mario mario 20087000 Oct 9 22:13 nightfall2.txt -rw-r--r--. 1 mario mario 20087000 Oct 9 22:13 nightfall3.txt -rw-r--r--. 1 mario mario 20087000 Oct 9 22:14 nightfall4.txt 0.000 secs Spawned child pid: 11194 for nightfall1.txt 0.000 secs This is child pid 11194 for nightfall1.txt. I am alive and +working! 0.000 secs Spawned child pid: 11195 for nightfall2.txt 0.001 secs opened nightfall1.txt and nightfall1.out 0.001 secs Spawned child pid: 11196 for nightfall3.txt 0.001 secs This is child pid 11195 for nightfall2.txt. I am alive and +working! 0.001 secs Spawned child pid: 11197 for nightfall4.txt 0.001 secs This is child pid 11196 for nightfall3.txt. I am alive and +working! 0.001 secs opened nightfall2.txt and nightfall2.out 0.001 secs This is child pid 11197 for nightfall4.txt. I am alive and +working! 0.001 secs opened nightfall3.txt and nightfall3.out 0.001 secs opened nightfall4.txt and nightfall4.out 2.243 secs Child 11194 finished working on nightfall1.txt! 2.261 secs Child 11197 finished working on nightfall4.txt! 2.274 secs Child 11196 finished working on nightfall3.txt! 2.275 secs Child 11195 finished working on nightfall2.txt! 2.276 secs Parenting talking...all my children are finished! Hooray!

      I'm back. I updated the MCE code to process line-by-line and ran again.

      mce-process-entire-chunk.pl 1.848 seconds mce-process-line-by-line.pl 2.264 seconds 2.235 ~ 2.281

      Workers processing entire chunk:

      user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; $$chunk_ref =~ tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; $$chunk_ref =~ s/w(as|ere)/be/gi; $$chunk_ref =~ s/$RE1/ $W1{lc $1} /g; $$chunk_ref =~ s/$RE2/ $W2{lc $1} /g; $$chunk_ref =~ s/$RE3/ $W3{lc $1} /g; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $$chunk_ref; $OUT_FH->flush; MCE->relay_unlock; }

      Workers processing line-by-line:

      user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; my $output = ''; open my $fh, '<', $chunk_ref; while (<$fh>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; s/w(as|ere)/be/gi; s/$RE1/ $W1{lc $1} /g; s/$RE2/ $W2{lc $1} /g; s/$RE3/ $W3{lc $1} /g; $output .= $_; } close $fh; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $output; $OUT_FH->flush; MCE->relay_unlock; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11147271]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-16 20:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found