hashes in regexes

DeusVult has asked for the wisdom of the Perl Monks concerning the following question:

Ok, monks, here is an interesting little problem for you. Suppose you have a scalar, $string, which is a multi-line text string (slurped up a whole file into it). Suppose you also have a hash, let's call it %substitute_hash. The keys of this hash are strings which may or may not appear in $string, but you most certainly do not want them to appear there anymore. Handily enough, the values inside the hash are the things you would rather have in place of those keys.

So, if you had three guys named John, Jack, and Joe who just changed their names to Mike, Mark, and Moe, %substitute_hash might look like this:

%substitute_hash = ( John => Mike, Jack => Mark, Joe => Moe );

Now the problem proper: I would like to write the following regex:

$string =~ s/(anything which is a key in %substitute_hash/$substitute_hash{the thing found in between the first two /'s}/;

The best thing I can think of is to do the cheap little hackish version:

foreach keys %substitute_hash {
  $string =~ s/$_/$substitute_hash{$_};
}
[download]

I'll do that as a last resort, but the application I need to do this for will involve doing this substitution on many, many files (several dozen, maybe 100+), and %substitute_hash will likely be very big (several hundred entries, maybe even a few thousand), so efficiency is really a factor.

Also, I'm not particularly married to this implementation, so if you can think of a really efficient way of doing this with some other data structure than a hash, that isn't a problem. Thanks in advance.

Some people drink from the fountain of knowledge, others just gargle.

Comment on hashes in regexes Select or Download Code

Replies are listed 'Best First'.
Re: hashes in regexes by japhy (Canon) on Mar 29, 2001 at 01:24 UTC
Your methods goes through the string N times, where N is the number of keys in your hash. Create the regex with `qr//`: `$keys = join '\|', map "\Q$_\E", keys %sub_hash; $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g;` [download] `japhy` -- Perl and Regex Hacker	[reply] [d/l]
Re^2: hashes in regexes by tadman (Prior) on Mar 29, 2001 at 04:12 UTC
I'm not sure about the application, but it might be advisable to at least put in some '\b's to prevent unfortunate collisions that would render some names as "Mikestone", or "Moeseph". Here is a slight modification of japhy's (or merlyn's, you decide) code: `$keys = join '\|', map "\\b\Q$_\E\\b", keys %sub_hash; $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g;` [download]	[reply] [d/l]
Re: hashes in regexes by merlyn (Sage) on Mar 29, 2001 at 01:26 UTC
untested, but I often get this stuff right on the first try: {grin} `my $regex = join "\|", map quotemeta, keys %substitute_hash; $regex = qr/$regex/; # compile ... $string =~ s/($regex)/$substitute_hash{$1}/g;` [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re (tilly) 2: hashes in regexes by tilly (Archbishop) on Mar 29, 2001 at 19:32 UTC
You might want to throw in a reverse sort there. It was not in the spec, but it may happen that someone will want to do one substitution on "foo" and another on "foobar". Since Perl's REs are DFAs rather than NFAs this is not going to work unless in the RE you see foobar before foo.,,	[reply]
Re: hashes in regexes by bjelli (Pilgrim) on Mar 29, 2001 at 05:26 UTC
As a lowly acolyte I have great respect for <kbd>map</kbd> and a fondness of good old <kbd>foreach</kbd>. So I naturally thought: isn't a loop with simple pattens faster than a pattern with <kbd>\|</kbd> in it? Here's my benchmarking code: use Benchmark; timethese(500, { 'or pattern' => \&orpattern, 'many patterns' => \&manypatterns }); sub init { %sub_hash = (John => Mike, Jack => Mark, Joe => Moe); $string = "Dear John!.\nI've run off with Jack and Joe.\nSue\n\n" x 1000; } sub orpattern { init; my $keys = join '\|', map "\Q$_\E", keys %sub_hash; my $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g; return $string; } sub manypatterns { init; my %patterns = map { qr/$_/ => $sub_hash{$_} } keys %sub_hash; foreach $pat (keys %patterns) { my $replace = $patterns{$pat}; $string =~ s/$pat/$replace/g; } return $string; } [download] And the results: Benchmark: timing 500 iterations of many patterns, or pattern... many patterns: 2 wallclock secs ( 1.61 usr + 0.00 sys = 1.61 CPU) or pattern: 12 wallclock secs (11.94 usr + 0.01 sys = 11.95 CPU) Of course there's a difference in functionality between the two, just think of: `%substitute_hash = ( Jack => Chris, Chris => Jaquline );` [download] -- Brigitte 'I never met a chocolate I didnt like' Jellinek http://www.horus.com/~bjelli/ http://perlwelt.horus.at	[reply] [d/l] [select]
Re: hashes in regexes by davorg (Chancellor) on Mar 29, 2001 at 13:25 UTC
If you're doing a lot of this kind of thing then maybe you should consider one of the many templating systems available from the CPAN. I'm a particular fan of the Template Toolkit. -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply]
Re: hashes in regexes by lachoy (Parson) on Mar 29, 2001 at 01:46 UTC
Just wanted to say: I think it's pretty cool that both answers given are exactly the same (barring minor naming differences). Gave me a chuckle, anyway... Chris M-x auto-bs-mode	[reply]
Re: Re: hashes in regexes by merlyn (Sage) on Mar 29, 2001 at 02:25 UTC
That's because this is an idiomatic way of doing it. We've all copied the same notes from the same authors. {grin} -- Randal L. Schwartz, Perl hacker	[reply]
Re: hashes in regexes by satchboost (Scribe) on Mar 29, 2001 at 01:51 UTC
The question I'd be asking is why you're slurping the whole file in at one time and wanting to do one substitution. It's neat and all, but don't make things too complicated if you don't have to. While the following sounds inefficient, this actually works quite well, in practice: `while (<IN_FILE>) { foreach my $key (keys %substitute_hash) { s/$key/$substitute_hash{$key}/g; } }` [download] Now, obviously, you'll need to assign the sub'ed into string to something if you want to save it and this will also overwrite previous substitutions. That may or may not be a factor. Now, if you want to do this real-time, you'll want to do it as above. But, I have a script that reads through some 100,000 lines of code in ~1450 files doing a given match and it runs in 10-30 seconds. (This is a compilation script, in case you're wondering.) I also have another script that does a number of matches and cross-correlations on those same files and it runs in about 2 minutes. Unless you really need to get faster than that, KISS.	[reply] [d/l]

Back to Seekers of Perl Wisdom