Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list

by xnous (Sexton)
on Oct 02, 2022 at 13:00 UTC ( [id://11147214] : note . print w/replies, xml ) Need Help??


in reply to Re: Need to speed up many regex substitutions and somehow make them a here-doc list
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:
wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm html2text-cpp Nightfall.htm >nightfall.txt for i in {1..1000}; do cat nightfall.txt >>in.txt; done
Now, in.txt is 77MB large. Bash/sed script:
#!/bin/bash cat *.txt | \ tr -d '[:punct:]' | \ sed 's/[0-9]//g' | \ sed 's/w\(as\|ere\)/be/gi' | \ sed 's/ need.* / need /gi' | \ sed 's/ .*meant.* / mean /gi' | \ sed 's/ .*work.* / work /gi' | \ sed 's/ .*read.* / read /gi' | \ sed 's/ .*allow.* / allow /gi' | \ sed 's/ .*gave.* / give /gi' | \ sed 's/ .*bought.* / buy /gi' | \ sed 's/ .*want.* / want /gi' | \ sed 's/ .*hear.* / hear /gi' | \ sed 's/ .*came.* / come /gi' | \ sed 's/ .*destr.* / destroy /gi' | \ sed 's/ .*paid.* / pay /gi' | \ sed 's/ .*selve.* / self /gi' | \ sed 's/ .*self.* / self /gi' | \ sed 's/ .*cities.* / city /gi' | \ sed 's/ .*fight.* / fight /gi' | \ sed 's/ .*creat.* / create /gi' | \ sed 's/ .*makin.* / make /gi' | \ sed 's/ .*includ.* / include /gi' | \ sed 's/ .*mean.* / mean /gi' | \ sed 's/ talk.* / talk /gi' | \ sed 's/ going / go /gi' | \ sed 's/ getting / get /gi' | \ sed 's/ start.* / start /gi' | \ sed 's/ goes / go /gi' | \ sed 's/ knew / know /gi' | \ sed 's/ trying / try /gi' | \ sed 's/ tried / try /gi' | \ sed 's/ told / tell /gi' | \ sed 's/ coming / come /gi' | \ sed 's/ saying / say /gi' | \ sed 's/ men / man /gi' | \ sed 's/ women / woman /gi' | \ sed 's/ took / take /gi' | \ sed 's/ tak.* / take /gi' | \ sed 's/ lying / lie /gi' | \ sed 's/ dying / die /gi' | \ sed 's/ made /make /gi' | \ sed 's/ used.* / use /gi' | \ sed 's/ using.* / use /gi' \ >|out-sed.dat
This script executes in around 5 seconds:
% time ./re.sh real 0m5,201s user 0m43,394s sys 0m1,302s
First Perl script, slurping input file at once and processing line-by-line:
#!/usr/bin/perl use strict; use warnings; use 5.36.0; my $BLOCKSIZE = 1024 * 1024 * 128; my $data; my $IN; my $out='out-perl.dat'; truncate $out, 0; open my $OUT, '>>', $out; my @text = glob("*.txt"); foreach my $t (@text) { open($IN, '<', $t) or next; read($IN, $data, $BLOCKSIZE); my @line = split /\n/, $data; foreach (@line) { s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; close $IN; print $OUT "$_\n"; } }
Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over 4 minutes. Time to slurp the input and split it in lines < 1 second.
% time ./re1.pl real 4m1,655s user 4m29,242s sys 0m0,380s
If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:
% time ./re2.pl real 1m5,096s user 1m11,889s sys 0m0,524s
Final test, I created 1000 copies of nightfall.txt (77KB) with % for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.

For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.

An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.

Replies are listed 'Best First'.
Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
by hippo (Bishop) on Oct 02, 2022 at 15:08 UTC

    Thanks for providing all this, that gives us a lot more to work on. I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

    $ echo Washington werewolves are wasteful | perl -pe 's/w(as|ere)/be/g +i;' behington bewolves are beteful $ echo No work was carried out on Thursday as that as a day of rest | +perl -pe 's/\s.*work.*/ work /gi;' No work $ echo Did you swallow all that bacon | perl -pe 's/\s.*allow.*/ allow + /gi;' Did allow $

    As there's no point optimising code which doesn't do what you want it would be good to clear this sort of thing up first.


    🦛

      hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

      Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same.

      Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

      I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions.

      hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

      AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

      I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.

        hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

        Glad it's making some progress, at least. :)

        It occurs to me now that since you do not need the /.*/ "to end of line" behaviour, you also do not actually need to split the text on newlines: you could work directly on the full text. That would substantially reduce the number of ops you execute, which should give a further speedup.

        The next step beyond that would be to combine the three substitutions into a single one, with a single hash. The idea here would be to concatenate the three regexps from the previous iteration, but wrapping the whole in (?|...) so the three distinct captures each get saved as $1, and make a single "master" lookup combining each of %w1, %w2, %w3. If we can combine "was/were" in there as well, I think we'd be starting to get properly competitive with the sed scripts.

        It is also worth considering whether you need Unicode support (I have no idea whether your sed supports it). If you do not need Unicode, you should also be able to get further speed by adding aa to the regexp flags, like my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}iaa;

        First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause.

        Secondly, it's unlikely that the speed of the regex-engine matters much if combined with the overhead to read those amounts of data. Processing data in RAM is now many magnitudes faster than file-systems.

        Benchmarking the whole workflow might give you a new perspective.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
by hv (Prior) on Oct 02, 2022 at 16:15 UTC

    Thanks for the additional detail.

    Is it the intention that each of these substitutions replaces one word with another word? Because the use of .* in many of the patterns means that's not what is actually happening. For example it looks like the intention is to replace the text "one two coworker three four" with the text "one two work three four", but it will actually be replaced with "one work " because the pattern \s.*work.* will match from the first space to the end of the line.

    Assuming that the intention is to replace one word with another word, that could look something like this:

    # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; # then in the loop s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; s{$re2}{ $w2{lc $1} }g; s{$re3}{ $w3{lc $1} }g; print $OUT "$_\n";

    If the input is always ASCII, the initial cleanup for punctuation and digits could potentially be something like s/[^a-z ]/ /gi or equivalently tr/a-zA-Z / /cs, unless you specifically wanted to replace "ABC123D" with the single word "ABCD" rather than the two words "ABC D". However if it may be Unicode, you would instead need something like s/[^\w ]/ /g, with no tr equivalent.

    The standalone substitution for w(as|ere) should probably be two additional entries in one of the existing hashes: currently this substitution is unique in replace a substring with another substring, so for example it will change "showered" into "shobed".

    It will also help a bit to move the close $IN out of the loop (though it doesn't actually seem to cause a noticeable slowdown).

    The above code runs for me about five times faster than your example perl code, though as described it behaves quite differently.

      I benchmarked your code.
      Here is my implementation:
      use strict; use warnings; # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #se +e discussion #my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #print "$re3\n"; #for debugging my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out $!"; my $start = time(); my $finish; open my $IN, '<', "nightfall.txt" or die " $!"; #75 MB file while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; # no punct no digits # other formulations +possible s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; #this ~2-3 sec s{$re2}{ $w2{lc $1} }g; #this ~3 sec s{$re3}{ $w3{lc $1} }g; #this ~6 (best) - 14 sec print $OUT "$_"; } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ Time to completion with \b added to begin of $re3 minutes: 0 seconds: 12
      As expected, $re1 is the fastest, $re2 has 1/2 the terms but takes a bit longer than $re2. $re3 as you posted took a LOT longer - 14 secs.
      $re3 is the one where the target can be in the middle of other characters and that is "expensive". I added a \b to regex3 which I don't think changes the meaning of what it does, but that cuts about 8 seconds off the execution time!

      I did the substitutions on a per line basis. In other testing, I found that to be faster than running "one shot" on the input as a single string. I suspect that is because less stuff needs to be moved around when doing a substitution into the much smaller line string.

      With a 12 second run time, this is getting into the range of the sed solution. I am not at all confident that the 5 second number can be equaled, much less bested. However, this is a lot closer to the goal.

        The following is a parallel demonstration based on Marshall's implementation.

        #!/usr/bin/env perl # https://www.perlmonks.org/?node_id=11147200 use strict; use warnings; use MCE; use Time::HiRes 'time'; die "usage: $0 infile1.txt [ infile2.txt ... ]\n" unless @ARGV; # substitute whole word only my %W1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %W2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %W3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $RE1 = qr{\b(@{[ join '|', reverse sort keys %W1 ]})\b}i; my $RE2 = qr{\b(@{[ join '|', reverse sort keys %W2 ]})\w*}i; my $RE3 = qr{\b\w*?(@{[ join '|', reverse sort keys %W3 ]})\w*}i; my $OUT_FH; # output file-handle used by workers # Spawn worker pool. my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => '64K', init_relay => 0, # specifying init_relay loads MCE::Relay use_slurpio => 1, # enable slurpio user_begin => sub { # worker begin routine per each file to be processed my ($outfile) = @{ MCE->user_args() }; open $OUT_FH, '>>', $outfile; }, user_end => sub { # worker end routine per each file to be processed close $OUT_FH if defined $OUT_FH; }, user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; # process entire chunk versus line-by-line $$chunk_ref =~ tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; $$chunk_ref =~ s/w(as|ere)/be/gi; $$chunk_ref =~ s/$RE1/ $W1{lc $1} /g; $$chunk_ref =~ s/$RE2/ $W2{lc $1} /g; $$chunk_ref =~ s/$RE3/ $W3{lc $1} /g; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $$chunk_ref; $OUT_FH->flush; MCE->relay_unlock; } )->spawn; # Process file(s). my $status = 0; while (my $infile = shift @ARGV) { if (-d $infile) { warn "WARN: '$infile': Is a directory, skipped\n"; $status = 1; } elsif (! -f $infile) { warn "WARN: '$infile': No such file, skipped\n"; $status = 1; } else { my $outfile = $infile; $outfile =~ s/\.txt$/.dat/; if ($outfile eq $infile) { warn "WARN: '$outfile': matches input name, skipped\n"; $status = 1; next; } # truncate output file open my $fh, '>', $outfile or do { warn "WARN: '$outfile': $!, skipped\n"; $status = 1; next; }; close $fh; # process file; pass argument(s) to workers my $start = time(); $mce->process($infile, { user_args => [ $outfile ] }); my $total_seconds = time() - $start; my $minutes = int($total_seconds / 60); my $seconds = $total_seconds - ($minutes * 60); printf "file: $infile mins: $minutes secs: %0.3f\n", $second +s; } } $mce->shutdown; # reap workers exit $status;
Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
by bliako (Monsignor) on Oct 04, 2022 at 10:25 UTC

    sed can take take in several substitution regexes at once instead of piping each substitution result to the next: sed 's/ need.* / need /gi' | sed 's/ .*meant.* / mean /gi' can become sed 's/ need.* / need /gi;s/ .*meant.* / mean /gi'. This may speed up IO.

    For both Perl and bash/sed: their IO can be improved by creating a ramdisk and placing input and output files in there if you intend to process them multiple times. Better if the files are created from other processes then you can create them straight into the ramdisk, process them and then transfer them to more permanent store. In Linux this is as easy as: mount -o size=2g -t tmpfs /mnt/ramdisk1

    If you have all files living already in just one physical harddisk then parallelising their processing (which implies parallelising the IO) will show little improvement or, most likely, degradation. However you can see some improvement by implementing a pipeline: in one process files are copied into the ramdisk sequentially, the other processes are, in parallel, processing any files found in there. I assume that memory IO can be parallelised better than harddisk IO (but I am a lot behind in what modern OS and CPU can do, or it could be that MCE can work some magik with IO, so use some salt with this advice).

    Also in a recent discussion here, the issue came up that a threaded perl interpreter can be 10-20-30% slower than a non-threaded one. So, if you do not need threads that's a possible way to speed things up (check your perl's interpreter compilation flags with: perl -V and look for useithreads=define)

    This is an interesting problem to optimise because even small optimisations can lead to huge savings over your 1,000's to 1,000,000's files. So, I would start by benchmarking a few options with like 20 files: sed, sed+ramdisk, perl+ramdisk, pipeline, etc. Then you will be more confident in where to place your programming efforts or whether you can invest in learning new skills like MCE.

    bw, bliako

      In addition to the above good tips, there is a talk by Nicholas Clark (from 2005), "When Perl is not quite fast enough", that explores some other things you can do to make your code a bit faster.

      But in general, the best optimizations are by optimizing the algorithm, and also first finding out what exactly the slow parts are before you start optimizing.

Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
by Marshall (Canon) on Oct 05, 2022 at 03:31 UTC
    I ran your code on my Windows machine. Took 1 minute 34 seconds.
    My implementation shown below.
    I don't think that a huge BLOCKSIZE and using read() gained you anything. Because you immediately read all the data back out of memory, only to create a very large array of lines. Then read each line again in a loop. Having the 128MB buffer won't have much effect on the reading time of the disk. The data is typically organized in 4Kbyte hunks. On a physical drive, there will often be a mechanically induced delay after each "hunk" is read. I have a physical drive and even with it, total read time for the whole 75 MB file line by line is << 1 sec. SSD of course will be faster, but raw I/O speed doesn't appear to be the limit.

    #!/usr/bin/perl use strict; use warnings; use Time::Local; my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out !"; my $start; my $finish; foreach my $text_file (<*.txt>) { print STDOUT "working on file $text_file\n"; $start = time(); open(my $IN, '<', $text_file) or die "invalid file: $text_file !"; # reading entire file line by line << 1 second overhead while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; print $OUT "$_"; } } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ working on file nightfall.txt minutes: 1 seconds: 34