Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list

by Marshall (Canon)
on Oct 05, 2022 at 03:31 UTC ( [id://11147250]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

I ran your code on my Windows machine. Took 1 minute 34 seconds.
My implementation shown below.
I don't think that a huge BLOCKSIZE and using read() gained you anything. Because you immediately read all the data back out of memory, only to create a very large array of lines. Then read each line again in a loop. Having the 128MB buffer won't have much effect on the reading time of the disk. The data is typically organized in 4Kbyte hunks. On a physical drive, there will often be a mechanically induced delay after each "hunk" is read. I have a physical drive and even with it, total read time for the whole 75 MB file line by line is << 1 sec. SSD of course will be faster, but raw I/O speed doesn't appear to be the limit.

#!/usr/bin/perl use strict; use warnings; use Time::Local; my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out !"; my $start; my $finish; foreach my $text_file (<*.txt>) { print STDOUT "working on file $text_file\n"; $start = time(); open(my $IN, '<', $text_file) or die "invalid file: $text_file !"; # reading entire file line by line << 1 second overhead while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; print $OUT "$_"; } } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ working on file nightfall.txt minutes: 1 seconds: 34
  • Comment on Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
  • Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11147250]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (1)
As of 2024-04-18 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found