Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:
wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm
html2text-cpp Nightfall.htm >nightfall.txt
for i in {1..1000}; do cat nightfall.txt >>in.txt; done
Now,
in.txt is 77MB large.
Bash/sed script:
#!/bin/bash
cat *.txt | \
tr -d '[:punct:]' | \
sed 's/[0-9]//g' | \
sed 's/w\(as\|ere\)/be/gi' | \
sed 's/ need.* / need /gi' | \
sed 's/ .*meant.* / mean /gi' | \
sed 's/ .*work.* / work /gi' | \
sed 's/ .*read.* / read /gi' | \
sed 's/ .*allow.* / allow /gi' | \
sed 's/ .*gave.* / give /gi' | \
sed 's/ .*bought.* / buy /gi' | \
sed 's/ .*want.* / want /gi' | \
sed 's/ .*hear.* / hear /gi' | \
sed 's/ .*came.* / come /gi' | \
sed 's/ .*destr.* / destroy /gi' | \
sed 's/ .*paid.* / pay /gi' | \
sed 's/ .*selve.* / self /gi' | \
sed 's/ .*self.* / self /gi' | \
sed 's/ .*cities.* / city /gi' | \
sed 's/ .*fight.* / fight /gi' | \
sed 's/ .*creat.* / create /gi' | \
sed 's/ .*makin.* / make /gi' | \
sed 's/ .*includ.* / include /gi' | \
sed 's/ .*mean.* / mean /gi' | \
sed 's/ talk.* / talk /gi' | \
sed 's/ going / go /gi' | \
sed 's/ getting / get /gi' | \
sed 's/ start.* / start /gi' | \
sed 's/ goes / go /gi' | \
sed 's/ knew / know /gi' | \
sed 's/ trying / try /gi' | \
sed 's/ tried / try /gi' | \
sed 's/ told / tell /gi' | \
sed 's/ coming / come /gi' | \
sed 's/ saying / say /gi' | \
sed 's/ men / man /gi' | \
sed 's/ women / woman /gi' | \
sed 's/ took / take /gi' | \
sed 's/ tak.* / take /gi' | \
sed 's/ lying / lie /gi' | \
sed 's/ dying / die /gi' | \
sed 's/ made /make /gi' | \
sed 's/ used.* / use /gi' | \
sed 's/ using.* / use /gi' \
>|out-sed.dat
This script executes in around 5 seconds:
% time ./re.sh
real 0m5,201s
user 0m43,394s
sys 0m1,302s
First Perl script, slurping input file at once and processing line-by-line:
#!/usr/bin/perl
use strict;
use warnings;
use 5.36.0;
my $BLOCKSIZE = 1024 * 1024 * 128;
my $data;
my $IN;
my $out='out-perl.dat';
truncate $out, 0;
open my $OUT, '>>', $out;
my @text = glob("*.txt");
foreach my $t (@text) {
open($IN, '<', $t) or next;
read($IN, $data, $BLOCKSIZE);
my @line = split /\n/, $data;
foreach (@line) {
s/[[:punct:]]/ /g;
tr/[0-9]//d;
s/w(as|ere)/be/gi;
s/\sneed.*/ need /gi;
s/\s.*meant.*/ mean /gi;
s/\s.*work.*/ work /gi;
s/\s.*read.*/ read /gi;
s/\s.*allow.*/ allow /gi;
s/\s.*gave.*/ give /gi;
s/\s.*bought.*/ buy /gi;
s/\s.*want.*/ want /gi;
s/\s.*hear.*/ hear /gi;
s/\s.*came.*/ come /gi;
s/\s.*destr.*/ destroy /gi;
s/\s.*paid.*/ pay /gi;
s/\s.*selve.*/ self /gi;
s/\s.*self.*/ self /gi;
s/\s.*cities.*/ city /gi;
s/\s.*fight.*/ fight /gi;
s/\s.*creat.*/ create /gi;
s/\s.*makin.*/ make /gi;
s/\s.*includ.*/ include /gi;
s/\s.*mean.*/ mean /gi;
s/\stalk.*/ talk /gi;
s/\sgoing / go /gi;
s/\sgetting / get /gi;
s/\sstart.*/ start /gi;
s/\sgoes / go /gi;
s/\sknew / know /gi;
s/\strying / try /gi;
s/\stried / try /gi;
s/\stold / tell /gi;
s/\scoming / come /gi;
s/\ssaying / say /gi;
s/\smen / man /gi;
s/\swomen / woman /gi;
s/\stook / take /gi;
s/\stak.*/ take /gi;
s/\slying / lie /gi;
s/\sdying / die /gi;
s/\smade /make /gi;
s/\sused.*/ use /gi;
s/\susing.*/ use /gi;
close $IN;
print $OUT "$_\n";
}
}
Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over
4 minutes. Time to slurp the input and split it in lines < 1 second.
% time ./re1.pl
real 4m1,655s
user 4m29,242s
sys 0m0,380s
If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:
% time ./re2.pl
real 1m5,096s
user 1m11,889s
sys 0m0,524s
Final test, I created 1000 copies of
nightfall.txt (77KB) with
% for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.
For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.
An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.