Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Large scale search and replace with perl -i

by antifun (Sexton)
on Apr 14, 2003 at 19:23 UTC ( [id://250385]=note: print w/replies, xml ) Need Help??


in reply to Large scale search and replace with perl -i

First question: how many is a "large number"? If it's on the order of 10^4 or less, you will probably spend more time fiddling with a script than it would take to do with a more "brute-force" approach. (Given reasonably fast computer, yadda yadda yadda.)

As for the more theoretical question, you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.

Then your next issue is avoiding the overhead of running multiple perls. The -i switch relies on the magic of <>, which is @ARGV if there are command-line arguments, and STDIN if there are not (paraphrasing slightly). However, what you need to do in this case is use both kinds of magic, so your perl will have to be a little more creative. It's harder to do the shuffle that -i does than to read from STDIN manually, so here's one way to try it:

find . -name "*.html" -type f -exec grep -l foo {} \; | perl -pi -e 'B +EGIN{ @ARGV = <STDIN>; chomp @ARGV }; while (<>) { s/foo/bar/g; } co +ntinue { print }'

Notice that you can fiddle with @ARGV before the <> magic takes place. The internals of the script are basically what the -p option does.


---
"I hate it when I think myself into a corner."
Matt Mitchell

Replies are listed 'Best First'.
Re^2: Large scale search and replace with perl -i (don't grep(1))
by Aristotle (Chancellor) on Apr 14, 2003 at 20:52 UTC
    you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.

    You would certainly not, because you will have to open all files anyway - even if just to check. The difference is that grepping for matches first will make you spawn one process per file as well as require to open the matching files another time (in Perl) to actually process them. You have a (large) net loss that way.

    Taking that out, and using the -print0 option to avoid some nasty surprises (but not all, unfortunately, due to the darn magic open) leaves us with the following. Note I have removed the continue {} block as it isn't necessary and just costs time. I'm also setting the record separator such that the diamond operator reads fixed size blocks (64kbytes in this example), rather than scanning for some end of line character.

    find . -name "*.html" -type f -print0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \ while (<>) { s/foo/bar/g; print }'

    That should be about as efficient as it gets.

    If you have a lot of nonmatching files, you might save work by hooking a grep in there - but not with find's -exec. That's what xargs was invented for.

    find . -name "*.html" -type f -print0 | \ xargs -r0 grep -l0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \ while (<>) { s/foo/bar/g; print }'
    Update: s/= \\65536!= "\\n"/; as per runrig's observation.

    Makeshifts last the longest.

      find . -name "*.html" -type f -print0 | perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = \65536 }; \ while (<>) { s/foo/bar/g; print }'
      You don't want to do that. If 'foo' spans across one of those read blocks, then you'll miss the substitution.
        Duh.. I can't believe I didn't think of that.

        Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://250385]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-23 20:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found