Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Need to speed up many regex substitutions and somehow make them a here-doc list

by xnous (Sexton)
on Oct 01, 2022 at 20:22 UTC ( [id://11147200]=perlquestion: print w/replies, xml ) Need Help??

xnous has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks. I've got a large number of text files (thousands to millions at a time, up to a couple of MB each) which I need to make a lot substitutions to (more than 150 to each). For quite some time I've been using a bash script which takes around 1 minute / 2000 files but, having used Perl in the long past, I decided to rewrite the script in Perl, hoping it would improve things considerably. However, using the standard (for my poor Perl skills at least) method of "open file; slurp it; loop through 150 substitutions" proved abyssmaly slower than bash/sed. Splitting the input down to 1 word at a time sped things up, but still to 60-70% slower than bash. Combining the regexes into one large sequence (s/^[0-9].*\s//m|s/\S*?talk\S*\s/ talk /gi...) didn't help either, as the interpeter probably optimizes them anyway. So, the problem is twofold:

1. Speed. For context, most substitutions turn gerunds and past tenses of select verbs into infinitives, trim out numbers or convert plural to singular... nothing too fancy, no backreferences or grouping.

2. I need to change the regex list often and a long string as shown above is hard to maintain. Ideally, I want to use a here-doc to list my substitutions, but I can't find a way to tell Perl how to use the resulting string in both the match and substitution parts of s///. If all else fails, I can split the regex into match/sub pairs as a workaround but I'm pretty sure there's a more elegant way to do it.

I'd appreciate your wisdom on the matters, the snippet is to show how I'd prefer #2 to be implemented. Thank you.

#!/usr/bin/perl use strict; use warnings; my @text = split /\n/, << 'TEXT'; Regular expressions have the undeserved reputation of being abstract and difficult to understand. TEXT my @regexlist = split /\n/, << 'REGEX'; s/a/A/g s/i/I/g s/e/E/g REGEX my $regex = join '|', @regexlist; while (<@text>) { // apply $regexes somehow, the fastest way possible; }

Replies are listed 'Best First'.
Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by AnomalousMonk (Archbishop) on Oct 02, 2022 at 00:12 UTC

    The dynamic alternation building technique described here and exemplified here is good for handling multiple stringA -> stringB replacement, but for an application such as you seem to have, multiple regex to string replacement, it's not a general solution. Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

    Win8 Strawberry 5.8.9.5 (32) Sat 10/01/2022 19:44:07 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); # for debug use constant IGNORE => qr{ \A \s* (?: [#] .*)? \z }xms; my $text = <<'TEXT'; Regexes have been tAlKed about as being abstract for 1000's of years. TEXT print "before ---$text--- \n"; my $regex_replacement_string = <<'REGEX'; a 'A' i 'I' # comment line w/o leading spaces e 'E' # optional entry comment ^ [0-9] .*? \s ''# optional comment # comment line with leading spaces (?i) \S*? talk \S* \s ' SPOKEN ' REGEX my ($rx_search, @replacelist) = build_search($regex_replacement_string +); dd $rx_search; # for debug dd \@replacelist; # for debug print "\n"; # for debug $text =~ s{ $rx_search }{$replacelist[$^R]}xmsg; print "after +++$text+++ \n"; sub build_search { my ($rx_replace_string, ) = @_; my $rx_sq_body = qr{ [^\\']* (?: \\. [^\\']*)* }xms; my $rx_comment_to_eol = qr{ [#] .* }xms; my @regexes; my @replacements; use re 'eval'; my @regexlist = split qr{ \s* \n }xms, $rx_replace_string; REGEX_REPLACEMENT: for my $rx_replace (@regexlist) { next REGEX_REPLACEMENT if $rx_replace =~ IGNORE; my $got_rx_replace = my ($rx, $replace) = $rx_replace =~ m{ \A \s* (.*?) \s* # everything before '-pair is regex ' ($rx_sq_body) ' \s* # capture body of '-pair $rx_comment_to_eol? # ignore optional trailing comment \s* \z }xms; die "bad regex/replacement '$rx_replace'" unless $got_rx_repla +ce; my $n = @regexes; $rx = "$rx (?{ $n })"; push @regexes, qr{ $rx }xms; push @replacements, $replace; } # end for REGEX_REPLACEMENT my ($rx_combined) = map qr{ $_ }xms, join ' | ', @regexes; return $rx_combined, @replacements; } # end sub build_search() ^Z before ---Regexes have been tAlKed about as being abstract for 1000's of years. --- qr/ (?msx-i: a (?{ 0 }) ) | (?msx-i: i (?{ 1 }) ) | (?msx-i: e (?{ 2 } +) ) | (?msx-i: ^ [0-9] .*? \s (?{ 3 }) ) | (?msx-i: (?i) \S*? talk \S* \s (?{ 4 }) ) /msx ["A", "I", "E", "", " SPOKEN "] after +++REgExEs hAvE bEEn SPOKEN About As bEIng AbstrAct for of yEArs. +++


    Give a man a fish:  <%-{-{-{-<

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by hv (Prior) on Oct 01, 2022 at 21:48 UTC

    While there are optimizations that are generally applicable, the greatest speed-ups generally come by taking advantage of specifics of an individual use-case. So it would be useful to have a more completely representative sample of the breadth of patterns you're likely to use.

    A likely aspect of any solution is, as shown by AnomalousMonk's code, to match many patterns and then use a hash lookup to find the string to substitute.

    It seems likely that making it easy to modify the set of substitutions may involve having a separate pre-processing step to combine some or all of them into larger regexps - either internally, as in AnomalousMonk's code, or as a separate program that writes out perl code for the combination.

    Combining the regexes into one large sequence (s/^[0-9].*\s//m|s/\S*?talk\S*\s/ talk /gi...) didn't help either ...

    This is just doing multiple, separate substitutions. The win will come from combining them into a single pattern (or failing that, into a smaller number of patterns).

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by kcott (Archbishop) on Oct 02, 2022 at 05:36 UTC

    G'day xnous,

    Welcome to the Monastery.

    Please follow the advice already given by ++hippo. In addition to representative input data, show the expected output data. Also, tell us what Perl version you're using: features in more recent Perls may improve performance.

    Here's some very general advice:

    • Have a read of "perlperf: Perl Performance and Optimization Techniques". Your basic steps should generally be: get the code working; profile to find areas of poor performance; benchmark to see if improvements are really improvements — you'll probably need multiple iterations of some, or all, of these steps.
    • Reading all records into an array, then reading them all again from that array, is effectively doing the work twice. Instead, read the records from the file and process them immediately. Making a backup of the original file first, is advisable.
    • The performance of regexes can often be improved if they include anchors; e.g. ^, $, \b, \b{}, and so on — see "perlrebackslash: Perl Regular Expression Backslash Sequences and Escapes" for details.
    • Use Regexp::Debugger to test individual regex patterns. I use this module a lot and often find places where I can make improvements; for instance, by eliminating backtracking.

    When you provide the requested code and data, more specific advice will likely be possible.

    — Ken

      Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:
      wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm html2text-cpp Nightfall.htm >nightfall.txt for i in {1..1000}; do cat nightfall.txt >>in.txt; done
      Now, in.txt is 77MB large. Bash/sed script:
      #!/bin/bash cat *.txt | \ tr -d '[:punct:]' | \ sed 's/[0-9]//g' | \ sed 's/w\(as\|ere\)/be/gi' | \ sed 's/ need.* / need /gi' | \ sed 's/ .*meant.* / mean /gi' | \ sed 's/ .*work.* / work /gi' | \ sed 's/ .*read.* / read /gi' | \ sed 's/ .*allow.* / allow /gi' | \ sed 's/ .*gave.* / give /gi' | \ sed 's/ .*bought.* / buy /gi' | \ sed 's/ .*want.* / want /gi' | \ sed 's/ .*hear.* / hear /gi' | \ sed 's/ .*came.* / come /gi' | \ sed 's/ .*destr.* / destroy /gi' | \ sed 's/ .*paid.* / pay /gi' | \ sed 's/ .*selve.* / self /gi' | \ sed 's/ .*self.* / self /gi' | \ sed 's/ .*cities.* / city /gi' | \ sed 's/ .*fight.* / fight /gi' | \ sed 's/ .*creat.* / create /gi' | \ sed 's/ .*makin.* / make /gi' | \ sed 's/ .*includ.* / include /gi' | \ sed 's/ .*mean.* / mean /gi' | \ sed 's/ talk.* / talk /gi' | \ sed 's/ going / go /gi' | \ sed 's/ getting / get /gi' | \ sed 's/ start.* / start /gi' | \ sed 's/ goes / go /gi' | \ sed 's/ knew / know /gi' | \ sed 's/ trying / try /gi' | \ sed 's/ tried / try /gi' | \ sed 's/ told / tell /gi' | \ sed 's/ coming / come /gi' | \ sed 's/ saying / say /gi' | \ sed 's/ men / man /gi' | \ sed 's/ women / woman /gi' | \ sed 's/ took / take /gi' | \ sed 's/ tak.* / take /gi' | \ sed 's/ lying / lie /gi' | \ sed 's/ dying / die /gi' | \ sed 's/ made /make /gi' | \ sed 's/ used.* / use /gi' | \ sed 's/ using.* / use /gi' \ >|out-sed.dat
      This script executes in around 5 seconds:
      % time ./re.sh real 0m5,201s user 0m43,394s sys 0m1,302s
      First Perl script, slurping input file at once and processing line-by-line:
      #!/usr/bin/perl use strict; use warnings; use 5.36.0; my $BLOCKSIZE = 1024 * 1024 * 128; my $data; my $IN; my $out='out-perl.dat'; truncate $out, 0; open my $OUT, '>>', $out; my @text = glob("*.txt"); foreach my $t (@text) { open($IN, '<', $t) or next; read($IN, $data, $BLOCKSIZE); my @line = split /\n/, $data; foreach (@line) { s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; close $IN; print $OUT "$_\n"; } }
      Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over 4 minutes. Time to slurp the input and split it in lines < 1 second.
      % time ./re1.pl real 4m1,655s user 4m29,242s sys 0m0,380s
      If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:
      % time ./re2.pl real 1m5,096s user 1m11,889s sys 0m0,524s
      Final test, I created 1000 copies of nightfall.txt (77KB) with % for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.

      For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.

      An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.

        Thanks for providing all this, that gives us a lot more to work on. I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

        $ echo Washington werewolves are wasteful | perl -pe 's/w(as|ere)/be/g +i;' behington bewolves are beteful $ echo No work was carried out on Thursday as that as a day of rest | +perl -pe 's/\s.*work.*/ work /gi;' No work $ echo Did you swallow all that bacon | perl -pe 's/\s.*allow.*/ allow + /gi;' Did allow $

        As there's no point optimising code which doesn't do what you want it would be good to clear this sort of thing up first.


        🦛

        Thanks for the additional detail.

        Is it the intention that each of these substitutions replaces one word with another word? Because the use of .* in many of the patterns means that's not what is actually happening. For example it looks like the intention is to replace the text "one two coworker three four" with the text "one two work three four", but it will actually be replaced with "one work " because the pattern \s.*work.* will match from the first space to the end of the line.

        Assuming that the intention is to replace one word with another word, that could look something like this:

        # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; # then in the loop s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; s{$re2}{ $w2{lc $1} }g; s{$re3}{ $w3{lc $1} }g; print $OUT "$_\n";

        If the input is always ASCII, the initial cleanup for punctuation and digits could potentially be something like s/[^a-z ]/ /gi or equivalently tr/a-zA-Z / /cs, unless you specifically wanted to replace "ABC123D" with the single word "ABCD" rather than the two words "ABC D". However if it may be Unicode, you would instead need something like s/[^\w ]/ /g, with no tr equivalent.

        The standalone substitution for w(as|ere) should probably be two additional entries in one of the existing hashes: currently this substitution is unique in replace a substring with another substring, so for example it will change "showered" into "shobed".

        It will also help a bit to move the close $IN out of the loop (though it doesn't actually seem to cause a noticeable slowdown).

        The above code runs for me about five times faster than your example perl code, though as described it behaves quite differently.

        sed can take take in several substitution regexes at once instead of piping each substitution result to the next: sed 's/ need.* / need /gi' | sed 's/ .*meant.* / mean /gi' can become sed 's/ need.* / need /gi;s/ .*meant.* / mean /gi'. This may speed up IO.

        For both Perl and bash/sed: their IO can be improved by creating a ramdisk and placing input and output files in there if you intend to process them multiple times. Better if the files are created from other processes then you can create them straight into the ramdisk, process them and then transfer them to more permanent store. In Linux this is as easy as: mount -o size=2g -t tmpfs /mnt/ramdisk1

        If you have all files living already in just one physical harddisk then parallelising their processing (which implies parallelising the IO) will show little improvement or, most likely, degradation. However you can see some improvement by implementing a pipeline: in one process files are copied into the ramdisk sequentially, the other processes are, in parallel, processing any files found in there. I assume that memory IO can be parallelised better than harddisk IO (but I am a lot behind in what modern OS and CPU can do, or it could be that MCE can work some magik with IO, so use some salt with this advice).

        Also in a recent discussion here, the issue came up that a threaded perl interpreter can be 10-20-30% slower than a non-threaded one. So, if you do not need threads that's a possible way to speed things up (check your perl's interpreter compilation flags with: perl -V and look for useithreads=define)

        This is an interesting problem to optimise because even small optimisations can lead to huge savings over your 1,000's to 1,000,000's files. So, I would start by benchmarking a few options with like 20 files: sed, sed+ramdisk, perl+ramdisk, pipeline, etc. Then you will be more confident in where to place your programming efforts or whether you can invest in learning new skills like MCE.

        bw, bliako

        I ran your code on my Windows machine. Took 1 minute 34 seconds.
        My implementation shown below.
        I don't think that a huge BLOCKSIZE and using read() gained you anything. Because you immediately read all the data back out of memory, only to create a very large array of lines. Then read each line again in a loop. Having the 128MB buffer won't have much effect on the reading time of the disk. The data is typically organized in 4Kbyte hunks. On a physical drive, there will often be a mechanically induced delay after each "hunk" is read. I have a physical drive and even with it, total read time for the whole 75 MB file line by line is << 1 sec. SSD of course will be faster, but raw I/O speed doesn't appear to be the limit.

        #!/usr/bin/perl use strict; use warnings; use Time::Local; my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out !"; my $start; my $finish; foreach my $text_file (<*.txt>) { print STDOUT "working on file $text_file\n"; $start = time(); open(my $IN, '<', $text_file) or die "invalid file: $text_file !"; # reading entire file line by line << 1 second overhead while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; print $OUT "$_"; } } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ working on file nightfall.txt minutes: 1 seconds: 34
Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by AnomalousMonk (Archbishop) on Oct 01, 2022 at 21:22 UTC

    See haukex's article Building Regex Alternations Dynamically:

    Win8 Strawberry 5.8.9.5 (32) Sat 10/01/2022 17:18:27 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); # for debug my $text = <<'TEXT'; Regular expressions have the undeserved reputation of being abstract and difficult to understand. TEXT print "before ---$text--- \n"; my @regexlist = split /\n/, <<'REGEX'; a A i I e E REGEX my %replace = map split, @regexlist; # dd \%replace; # for debug my ($rx_search) = map qr{ $_ }xms, join ' | ', map quotemeta, reverse sort keys %replace ; # dd $rx_search; # for debug $text =~ s{ ($rx_search) }{$replace{$1}}xmsg; print "after +++$text+++ \n"; ^Z before ---Regular expressions have the undeserved reputation of being abstract and difficult to understand. --- after +++REgulAr ExprEssIons hAvE thE undEsErvEd rEputAtIon of bEIng AbstrAct And dIffIcult to undErstAnd. +++

    Update: This approach assumes each text file can be slurped to memory; 2-100 MB should be no problem. It also assumes the number of substitutions is "reasonable"; 150-1000 should be no problem. Care must be exercised in building the $rx_search regex if it is more complex than shown in the example; see haukex's article for tips on this. I have no idea how fast this approach is versus the one you're using now. Good luck :)


    Give a man a fish:  <%-{-{-{-<

      > This approach assumes each text file can be slurped to memory; 2-100 MB should be no problem

      The OP could slice the input into big chunks separated at newline boundaries.

      If that's not possible he could alternatively use a sliding window which always continues at the pos where the last replacement ended.

      On a side note, your map qr{...} join ... irritated me a bit, because the processed list has only one element. Not sure if that's the clearest style.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        ... your map qr{...} join ... irritated me a bit, because the processed list has only one element.

        Yeah, that gets to me a bit too, whenever I use it. But that syntax is used in haukex's original article, so I'm willing to consider it an "idiom." :)

        The important point is that the regex elements be somehow converted into a regex object. It's at this stage that any necessary boundary assertions are added. The only reasonable alternative I can see is something like

        my $rx_search = join ' | ', map quotemeta, reverse sort keys %replace ; $rx_search = qr{ ... $rx_search ... }xms;
        That's slightly more irritating to me and doesn't seem to clarify anything either.


        Give a man a fish:  <%-{-{-{-<

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by hippo (Bishop) on Oct 01, 2022 at 21:37 UTC
    60-70% slower than bash

    That sounds sub-optimal. Please show: your bash script, your Perl script and a representative set of data (does not have to be large). Without all three, anything else would be guess work.


    🦛

      (Deleted, replied to wrong level)

        Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by davies (Prior) on Oct 01, 2022 at 21:32 UTC

    Have a look at the discussion following BrowserUK's advice to me at Re: Combining regexes. The gist is that for single-character substitutions, tr is faster than a regex.

    Regards,

    John Davies

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by LanX (Saint) on Oct 02, 2022 at 00:33 UTC
    I wouldn't be surprised if sed was faster than Perl with trivial regexes, because all those fancy features come with complexity and are slowing the engine down.

    But one off those features is Trie optimization° of "or" alternations with |

    Hence a one-pass solution with a hash lookup like demonstrated by Anomalous could be far faster in Perl than in Sed (does Sed even allow hash lookups?)

    But this also depends on details only you know, like if you expect a snippet to be processed multiple times.

    Like "hidings > hiding > hide" ˛ wouldn't be possible with this one-pass approach, you'd need to repeat it till it doesn't replace anymore.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    updates

    °) see how-does-perls-regex-implementation-makes-use-of-tries#57484188

    ˛) i.e. multiple reductions gerund after plural

Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by Fletch (Bishop) on Oct 03, 2022 at 03:54 UTC

    Other than the other optimisations already suggested you might also find speedups using Parallel::ForkManager or MCE::Loop. Any one individual chunk/group of lines taking "a bit longer" could be mitigated by being able to (more easily than from the shell) run against n of them in parallel across the entire file.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)
by marioroy (Prior) on Oct 05, 2022 at 22:10 UTC

    Running multiple sed commands enables parallel as can be seen in the output. Notice the user-time greater than real-time.

    % time ./re.sh real 0m5,201s user 0m43,394s sys 0m1,302

    The following is a demonstration for processing a huge file (eg. > 700 MB) using MCE, as that is the demonstration given by the OP to tackle. I made this to consume minimum overhead regarding the use of MCE. For example, workers write directly to the output handle in an orderly fashion versus passing the result to the manager process.

    #!/usr/bin/env perl # https://www.perlmonks.org/?node_id=11147200 use strict; use warnings; use MCE; die "usage: $0 infile1.txt [ infile2.txt ... ]\n" unless @ARGV; my $OUT_FH; # output file-handle used by workers # Spawn worker pool. my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => '64K', init_relay => 0, # specifying init_relay loads MCE::Relay use_slurpio => 1, # enable slurpio user_begin => sub { # worker begin routine per each file to be processed my ($outfile) = @{ MCE->user_args() }; open $OUT_FH, '>>', $outfile; }, user_end => sub { # worker end routine per each file to be processed close $OUT_FH if defined $OUT_FH; }, user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; process_chunk($chunk_ref); } )->spawn;

    The above spawns a pool of workers. Let's process a file.

    # first, truncate output file { open my $fh, '>', "out-sed.dat" or die "$!\n"; } $mce->process("in.txt", { user_args => [ "out-sed.dat" ] }) $mce->shutdown;

    Or replace the prior 4 lines with the following to process a list of files, re-using worker pool.

    # Process file(s). my $status = 0; while (my $infile = shift @ARGV) { if (-d $infile) { warn "WARN: '$infile': Is a directory, skipped\n"; $status = 1; } elsif (! -f $infile) { warn "WARN: '$infile': No such file, skipped\n"; $status = 1; } else { my $outfile = $infile; $outfile =~ s/\.txt$/.dat/; if ($outfile eq $infile) { warn "WARN: '$outfile': matches input name, skipped\n"; $status = 1; next; } # truncate output file open my $fh, '>', $outfile or do { warn "WARN: '$outfile': $!, skipped\n"; $status = 1; next; }; close $fh; # process file; pass argument(s) to workers $mce->process($infile, { user_args => [ $outfile ] }); } } $mce->shutdown; # reap workers exit $status;

    Next is the function in which workers process the chunk line by line. Since we specified use_slurpio, $chunk_ref is a scalar reference. This appends the result per each line to the $output scalar. Finally, upon exiting the loop, workers output to the file handle serially and orderly.

    # Worker function. sub process_chunk { my ($chunk_ref) = @_; my $output = ''; open my $fh, '<', $chunk_ref; while (<$fh>) { s/[[:punct:]]//g; s/[0-9]//g; s/w(as|ere)/be/gi; ... # append to output var $output .= $_; } close $fh; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $output; $OUT_FH->flush; MCE->relay_unlock; }

    Another way is to process the chunk all at once, omitting the while loop.

    # Worker function. sub process_chunk { my ($chunk_ref) = @_; $$chunk_ref =~ s/[[:punct:]]//g; $$chunk_ref =~ s/[0-9]//g; $$chunk_ref =~ s/w(as|ere)/be/gi; ... # Output orderly and serially. MCE->relay_lock; print $OUT_FH $$chunk_ref; $OUT_FH->flush; MCE->relay_unlock; }

      Sometimes, I like to know "really" what the overhead is for MCE. So, here is something to measure the chunking nature of MCE. Simply comment out user_begin, user_end, and the process routine. That's it.

      #!/usr/bin/env perl use strict; use warnings; use MCE; use Time::HiRes 'time'; die "usage: $0 infile1.txt\n" unless @ARGV; my $OUT_FH; # output file-handle used by workers # Spawn worker pool. my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => '64K', init_relay => 0, # specifying init_relay loads MCE::Relay use_slurpio => 1, # enable slurpio # user_begin => sub { # # worker begin routine per each file to be processed # my ($outfile) = @{ MCE->user_args() }; # open $OUT_FH, '>>', $outfile; # }, # user_end => sub { # # worker end routine per each file to be processed # close $OUT_FH if defined $OUT_FH; # }, user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; # process_chunk($chunk_ref); } )->spawn; my $start = time; $mce->process($ARGV[0]); printf "%0.3f seconds\n", time - $start;

      I have a big file which is 767 MB. The overhead is a fraction of a second.

      $ ls -lh big.txt -rw-r--r-- 1 mario mario 767M Oct 5 10:07 big.txt $ perl demo.pl big.txt 0.154 seconds

      Edit: That was from OS level cache as I had read the file from prior testing.

        The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2.

        Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop.

        use strict; use warnings; use MCE; use Time::HiRes 'time'; sub process_file { my ($file) = @_; } my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => 2, user_func => sub { my ($mce, $chunk_ref, $chunk_id) = @_; process_file($_) for @{ $chunk_ref }; } )->spawn; my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files my $start = time; $mce->process(\@file_list); printf "%0.3f seconds\n", time - $start; $mce->shutdown; # reap workers

        Let's find out the IPC overhead. I wonder myself.

        chunk_size 1 3.773 seconds 1 million chunks chunk_size 2 1.930 seconds 500 thousand chunks chunk_size 10 0.423 seconds 100 thousand chunks chunk_size 20 0.234 seconds 50 thousand chunks

        It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.

      I ran with the following for comparing with the sed output. See complementary post for a version based on Marshall's implementation.

      # Worker function. sub process_chunk { my ($chunk_ref) = @_; $$chunk_ref =~ s/[[:punct:]]//g; $$chunk_ref =~ s/[0-9]//g; $$chunk_ref =~ s/w(as|ere)/be/gi; $$chunk_ref =~ s/ need.* / need /gi; $$chunk_ref =~ s/ .*meant.* / mean /gi; $$chunk_ref =~ s/ .*work.* / work /gi; $$chunk_ref =~ s/ .*read.* / read /gi; $$chunk_ref =~ s/ .*allow.* / allow /gi; $$chunk_ref =~ s/ .*gave.* / give /gi; $$chunk_ref =~ s/ .*bought.* / buy /gi; $$chunk_ref =~ s/ .*want.* / want /gi; $$chunk_ref =~ s/ .*hear.* / hear /gi; $$chunk_ref =~ s/ .*came.* / come /gi; $$chunk_ref =~ s/ .*destr.* / destroy /gi; $$chunk_ref =~ s/ .*paid.* / pay /gi; $$chunk_ref =~ s/ .*selve.* / self /gi; $$chunk_ref =~ s/ .*self.* / self /gi; $$chunk_ref =~ s/ .*cities.* / city /gi; $$chunk_ref =~ s/ .*fight.* / fight /gi; $$chunk_ref =~ s/ .*creat.* / create /gi; $$chunk_ref =~ s/ .*makin.* / make /gi; $$chunk_ref =~ s/ .*includ.* / include /gi; $$chunk_ref =~ s/ .*mean.* / mean /gi; $$chunk_ref =~ s/ talk.* / talk /gi; $$chunk_ref =~ s/ going / go /gi; $$chunk_ref =~ s/ getting / get /gi; $$chunk_ref =~ s/ start.* / start /gi; $$chunk_ref =~ s/ goes / go /gi; $$chunk_ref =~ s/ knew / know /gi; $$chunk_ref =~ s/ trying / try /gi; $$chunk_ref =~ s/ tried / try /gi; $$chunk_ref =~ s/ told / tell /gi; $$chunk_ref =~ s/ coming / come /gi; $$chunk_ref =~ s/ saying / say /gi; $$chunk_ref =~ s/ men / man /gi; $$chunk_ref =~ s/ women / woman /gi; $$chunk_ref =~ s/ took / take /gi; $$chunk_ref =~ s/ tak.* / take /gi; $$chunk_ref =~ s/ lying / lie /gi; $$chunk_ref =~ s/ dying / die /gi; $$chunk_ref =~ s/ made /make /gi; $$chunk_ref =~ s/ used.* / use /gi; $$chunk_ref =~ s/ using.* / use /gi; # Output orderly and serially. MCE->relay_lock; print $OUT_FH $$chunk_ref; $OUT_FH->flush; MCE->relay_unlock; }
Re: Need to speed up many regex substitutions and somehow make them a here-doc list
by xnous (Sexton) on Nov 14, 2022 at 22:04 UTC
    Hello, again. I apologize for the long absence, real life and such.

    I extensively tested both Marshall's fork implementation and marioroy's MCE "entire-chunk" implementation and, depending on the test case, the results are very variable and very interesting.

    Test case 1, a few thousands of <1MB files: fork.pl->2.2", mce.pl->9.4". The MCE one couldn't saturate my 4C/8T CPU, load was at around 60%.

    Test case 2, the above files concatenated into a 30MB one: fork.pl->3.2", mce.pl->1.8". Here, the fork one left the CPU mostly idle.

    Test case 3, 200 files sized 1.4MB each: fork.pl->15.2", mce.pl->25.7".

    In general, MCE can't keep the CPU 100% busy, even with max_workers set at 2X the thread count. Also, chunk_size can make a big difference depending on the size of the files processed. For large single-file input, MCE simply dominates; for lots of small files, the fork implementation is the undisputed winner.

    Most importantly, you people proved that Perl, in the right hands of course, can outperform almost everything and I'd like to thank every monk who shared his wisdom with me in this thread. My OP question has been fully answered.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11147200]
Approved by davies
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-24 12:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found