Re: Massive regexp search and replace

I assume that source in the patterns are unique. This assumption comes from the fact that it they are not, then you end up doing only the first. If that assumption is correct, then I suggest you parse the patterns as hash instead of list, this would remove someamount of splits. Like this:

# assume REGEX is the pattern filehandle
# asseme INPUT is the your input filehandle
my %regex=();
while (<REGEX>) {
  chomp;
  my ($key,$value) = split (\t,$_);
  $value = "\"$value\"";
  $regex{$key}=$value;
}

while (<INPUT>) {
  s/$key/$regex{$key}/gee foreach my $key (keys %regex);
}
[download]

This could also allow testing if there is an regex you want to use 'exists()' (depending on input, eg change only certain column within csv file or something). But since I don't know if input is suitable for this, i can't know if exists could be used. If it could, you might be able to drop the second foreach loop completely.

Comment on Re: Massive regexp search and replace Download Code

Replies are listed 'Best First'.
Re^2: Massive regexp search and replace by albert.llorens (Initiate) on Feb 10, 2005 at 13:31 UTC
Thanx Hena. I will try what you suggest and see if it reduces processing time sufficiently. As for your assumtions, a sample replacement patterns list (REGEX) could be: `\b([a-z])([a-z]*)ung\b \u$1\l$2ung Treecontrol Tree Control [Tt]abreiter Reiterelement [Tt]ile Teilbild` [download] And a sample input text (INPUT) for the replacements could be: `Die Segnung ist gestern erfolgt. Die segnung ist gestern erfolgt. Die Rechnung wird geschickt. Die rechnung wird geschickt. Die Treecontrol. Die Tabreiter. Die tabreiter. Die Tile. Die tile.` [download] I wonder if this changes anything in what you suggest...	[reply] [d/l] [select]
Re^3: Massive regexp search and replace by Hena (Friar) on Feb 10, 2005 at 14:05 UTC
Well, all direct text translations might be handled faster... but unless there is a lot of them compared to others then it probably won't help (might actually be slower). The actual help would be better to be tested as this is pure speculation :). Basicly make to hashes instead of one. Something like this. `while (<REGEX>) { chomp; my ($key,$value) = split (\t,$_); $value = "\"$value\""; if ($key=~s/^\w+$/) { $simple{$key}=$value; } else { $regex{$key}=$value; } } while (<INPUT>) { s/$key/$regex{$key}/gee foreach my $key (keys %regex); foreach (split (/\s+/,$_)) { if (exists($simple{$_})) { push (@line,$simple{$_}); } else { push (@line,$_); } } print OUT "@line\n"; }` [download] Note that in the given examples, you might write out the '`[Tt]ile`' pattern to Tile and tile rows. Which would move it from slower pattern group to faster. But as I said, I'm not sure how much this would help.	[reply] [d/l] [select]
Re^3: Massive regexp search and replace by hsinclai (Deacon) on Feb 10, 2005 at 14:07 UTC
Expanding on Hena's idea I wonder if it would be even more efficient to use Tie::File to go through, writing replacements as you go (untested): `use Tie::File; my $inputfile = "samplein.txt"; &replacer($inputfile); sub replacer { tie my @currentfile, 'Tie::File', $inputfile or die "$!"; my $inputline; foreach $inputline ( $currentfile[0] .. $#currentfile ) { foreach my $key (keys %regex) { $inputline =~ s/$key/$regex{$key}/gee; } } untie @currentfile; } ## Totally untested` [download] Seems like the write operation would be faster with Tie::File	[reply] [d/l]


Think about Loose Coupling
	PerlMonks