http://qs321.pair.com?node_id=67956

DeusVult has asked for the wisdom of the Perl Monks concerning the following question:

Ok, monks, here is an interesting little problem for you. Suppose you have a scalar, $string, which is a multi-line text string (slurped up a whole file into it). Suppose you also have a hash, let's call it %substitute_hash. The keys of this hash are strings which may or may not appear in $string, but you most certainly do not want them to appear there anymore. Handily enough, the values inside the hash are the things you would rather have in place of those keys.

So, if you had three guys named John, Jack, and Joe who just changed their names to Mike, Mark, and Moe, %substitute_hash might look like this:

%substitute_hash = ( John => Mike, Jack => Mark, Joe => Moe );

Now the problem proper: I would like to write the following regex:

 $string =~ s/(anything which is a key in %substitute_hash/$substitute_hash{the thing found in between the first two /'s}/;

The best thing I can think of is to do the cheap little hackish version:

foreach keys %substitute_hash { $string =~ s/$_/$substitute_hash{$_}; }

I'll do that as a last resort, but the application I need to do this for will involve doing this substitution on many, many files (several dozen, maybe 100+), and %substitute_hash will likely be very big (several hundred entries, maybe even a few thousand), so efficiency is really a factor.

Also, I'm not particularly married to this implementation, so if you can think of a really efficient way of doing this with some other data structure than a hash, that isn't a problem. Thanks in advance.

Some people drink from the fountain of knowledge, others just gargle.

Replies are listed 'Best First'.
Re: hashes in regexes
by japhy (Canon) on Mar 29, 2001 at 01:24 UTC
    Your methods goes through the string N times, where N is the number of keys in your hash. Create the regex with qr//:
    $keys = join '|', map "\Q$_\E", keys %sub_hash; $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g;


    japhy -- Perl and Regex Hacker
      I'm not sure about the application, but it might be advisable to at least put in some '\b's to prevent unfortunate collisions that would render some names as "Mikestone", or "Moeseph".

      Here is a slight modification of japhy's (or merlyn's, you decide) code:
      $keys = join '|', map "\\b\Q$_\E\\b", keys %sub_hash; $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g;
Re: hashes in regexes
by merlyn (Sage) on Mar 29, 2001 at 01:26 UTC
    untested, but I often get this stuff right on the first try: {grin}
    my $regex = join "|", map quotemeta, keys %substitute_hash; $regex = qr/$regex/; # compile ... $string =~ s/($regex)/$substitute_hash{$1}/g;

    -- Randal L. Schwartz, Perl hacker

      You might want to throw in a reverse sort there. It was not in the spec, but it may happen that someone will want to do one substitution on "foo" and another on "foobar". Since Perl's REs are DFAs rather than NFAs this is not going to work unless in the RE you see foobar before foo.,,
Re: hashes in regexes
by bjelli (Pilgrim) on Mar 29, 2001 at 05:26 UTC

    As a lowly acolyte I have great respect for <kbd>map</kbd> and a fondness of good old <kbd>foreach</kbd>. So I naturally thought: isn't a loop with simple pattens faster than a pattern with <kbd>|</kbd> in it?

    Here's my benchmarking code:

    use Benchmark; timethese(500, { 'or pattern' => \&orpattern, 'many patterns' => \&manypatterns }); sub init { %sub_hash = (John => Mike, Jack => Mark, Joe => Moe); $string = "Dear John!.\nI've run off with Jack and Joe.\nSue\n\n" x 1000; } sub orpattern { init; my $keys = join '|', map "\Q$_\E", keys %sub_hash; my $keys_REx = qr/$keys/; $string =~ s/($keys_REx)/$sub_hash{$1}/g; return $string; } sub manypatterns { init; my %patterns = map { qr/$_/ => $sub_hash{$_} } keys %sub_hash; foreach $pat (keys %patterns) { my $replace = $patterns{$pat}; $string =~ s/$pat/$replace/g; } return $string; }

    And the results:

    Benchmark: timing 500 iterations of many patterns, or pattern...
    many patterns:  2 wallclock secs ( 1.61 usr +  0.00 sys =  1.61 CPU)
    or pattern: 12 wallclock secs (11.94 usr +  0.01 sys = 11.95 CPU)

    Of course there's a difference in functionality between the two, just think of:

    %substitute_hash = ( Jack => Chris, Chris => Jaquline );
    --
    Brigitte    'I never met a chocolate I didnt like'    Jellinek
    http://www.horus.com/~bjelli/         http://perlwelt.horus.at
Re: hashes in regexes
by davorg (Chancellor) on Mar 29, 2001 at 13:25 UTC

    If you're doing a lot of this kind of thing then maybe you should consider one of the many templating systems available from the CPAN. I'm a particular fan of the Template Toolkit.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: hashes in regexes
by lachoy (Parson) on Mar 29, 2001 at 01:46 UTC

    Just wanted to say: I think it's pretty cool that both answers given are exactly the same (barring minor naming differences). Gave me a chuckle, anyway...

    Chris
    M-x auto-bs-mode

Re: hashes in regexes
by satchboost (Scribe) on Mar 29, 2001 at 01:51 UTC
    The question I'd be asking is why you're slurping the whole file in at one time and wanting to do one substitution. It's neat and all, but don't make things too complicated if you don't have to. While the following sounds inefficient, this actually works quite well, in practice:
    while (<IN_FILE>) { foreach my $key (keys %substitute_hash) { s/$key/$substitute_hash{$key}/g; } }

    Now, obviously, you'll need to assign the sub'ed into string to something if you want to save it and this will also overwrite previous substitutions. That may or may not be a factor.

    Now, if you want to do this real-time, you'll want to do it as above. But, I have a script that reads through some 100,000 lines of code in ~1450 files doing a given match and it runs in 10-30 seconds. (This is a compilation script, in case you're wondering.) I also have another script that does a number of matches and cross-correlations on those same files and it runs in about 2 minutes. Unless you really need to get faster than that, KISS.