http://qs321.pair.com?node_id=455179

mdi has asked for the wisdom of the Perl Monks concerning the following question:

I need to do multiple substitutions in several large (1-10MB) files. I've been using this:
use strict; use warnings; use Tie::File; foreach my $x (@ARGV) { tie my @f, 'Tie::File', $x or die "Could not tie $x: $!\n"; for (@f) { s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/; } }
...but this is taking entirely too long, and using up too much CPU. How can I do this more efficiently?

Replies are listed 'Best First'.
Re: Multiple substitutions in large files
by Joost (Canon) on May 09, 2005 at 13:48 UTC
Re: Multiple substitutions in large files
by dragonchild (Archbishop) on May 09, 2005 at 13:46 UTC
    #!/usr/bin/perl -p s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/;

    Execute as so:

    my_scriptydoo.pl file1 > file2

    Update: ikegami is absolutely correct. I should be doing a redirect. The next 1st level response provides the -pi version.


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Shouldn't that be -pi (or -pi.bak if a backup is desired)? With just -p, the usage would be my_scriptydoo.pl file1 > file1.new
Re: Multiple substitutions in large files
by ikegami (Patriarch) on May 09, 2005 at 14:58 UTC

    a|b||d becomes a|b|\N|d
    |b|c|d becomes \N|b|c|d
    a|b|c| becomes a|b|c|\N
    and similarly,
    a|b|.|d becomes a|b|\N|d
    but
    .|b|c|d does not become \N|b|c|d
    a|b|c|. does not become a|b|c|\N
    Is that a bug?

    If the above is a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^\.?(?=\|)/\\N/; s/(?<=\|)\.?(?=\||$)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    If the above is not a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^(?=\|)/\\N/; s/(?<=\|)(?=\||$)/\\N/g; s/(?<=\|)\.(?=\|)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    I reduced the number of regexps by combining a few, I shortened the regexps by removing the spaces first (not last), and I used zero-widths positive lookaheads and lookbehinds to mimimze the text being captured and substituted.

    Use this in conjuction with the -p or -pi suggestion for better results.