comment on

Hello, monks. I've got a large number of text files (thousands to millions at a time, up to a couple of MB each) which I need to make a lot substitutions to (more than 150 to each). For quite some time I've been using a bash script which takes around 1 minute / 2000 files but, having used Perl in the long past, I decided to rewrite the script in Perl, hoping it would improve things considerably. However, using the standard (for my poor Perl skills at least) method of "open file; slurp it; loop through 150 substitutions" proved abyssmaly slower than bash/sed. Splitting the input down to 1 word at a time sped things up, but still to 60-70% slower than bash. Combining the regexes into one large sequence (s/^[0-9].*\s//m|s/\S*?talk\S*\s/ talk /gi...) didn't help either, as the interpeter probably optimizes them anyway. So, the problem is twofold:

1. Speed. For context, most substitutions turn gerunds and past tenses of select verbs into infinitives, trim out numbers or convert plural to singular... nothing too fancy, no backreferences or grouping.

2. I need to change the regex list often and a long string as shown above is hard to maintain. Ideally, I want to use a here-doc to list my substitutions, but I can't find a way to tell Perl how to use the resulting string in both the match and substitution parts of s///. If all else fails, I can split the regex into match/sub pairs as a workaround but I'm pretty sure there's a more elegant way to do it.

I'd appreciate your wisdom on the matters, the snippet is to show how I'd prefer #2 to be implemented. Thank you.

#!/usr/bin/perl
use strict;
use warnings;

my @text = split /\n/, << 'TEXT';
Regular expressions have the undeserved reputation
of being abstract and difficult to understand.
TEXT

my @regexlist = split /\n/, << 'REGEX';
s/a/A/g
s/i/I/g
s/e/E/g
REGEX

my $regex = join '|', @regexlist;

while (<@text>) {
//    apply $regexes somehow, the fastest way possible;
}
[download]

In reply to Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks