http://qs321.pair.com?node_id=132177

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file (called in.txt) that is formatted with field demiliters and text like this: BEGTITLE blah,blah...blah ENDTITLE. The title field is followed by other fields delimiteted the same way. I have been assigned the glorious task of replacing specific words or useless metacharachters only between the title tags, leaving all other instances untouched. I have tried various methods with zero luck (this will probably not be surprising to an experienced PERL programmer!). Here is a sample of my code and INFILE:
Sample in.txt BEGPUB Wirey Haired Dog's Life ENDPUB BEGTITLE My dog has wirey hair ENDTITLE BEGTXT My wirey hair dog...blah, blah. ENDTXT
My Script
#!/usr/bin/perl -w open(INFILE,$ARGV[0]) or die "INFILE CROAKED"; open(OUTFILE,">$ARGV[1]) or die "OUTFILE CROAKED"; While(<INFILE>){ while(/BEGTITLE.*wirey.*?ENDTITLE/){ $_=~s/wirey/smooth/i; print OUTFILE $_;} } close INFILE; close OUTFILE;
I want to replace wirey with smooth in just the title field (which may span multiple lines). My example will replace the word wirey in the other fields as well. I have no basis to believe that my code is anywhere near being correct or useful. I do not even know why I though this could lead anywhere good. Any ideas or suggestions will be much appreciated. Thanks in advance, Stephen

Replies are listed 'Best First'.
Re: Regex: find/Replace words between tags
by TomK32 (Monk) on Dec 15, 2001 at 07:49 UTC
    1) it's while not While
    2) the replacing works wrong

    #!/usr/bin/perl -w while(<DATA>){ $_=~s/^(BEGTITLE.*) (wirey) (.*ENDTITLE)$/\1 smooth \3/i; print $_; } __DATA__ BEGPUB Wirey Haired Dogs Life ENDPUB BEGTITLE My dog has wirey hair ENDTITLE BEGTXT My wirey hair dog...blah, blah. ENDTXT
    and don't cut the extra spaces

    -- package Lizard::King; sub can { do { 'anything'} };
Re: Regex: find/Replace words between tags (non greedy re )
by mandog (Curate) on Dec 15, 2001 at 14:44 UTC
    This will probably do it.

    btw, You might want your die msgs to include the $! variable to see why your file ops failed.

    The .*? was the tricky part. (Non greedy matching.)

    #!/usr/bin/perl -w use strict; open(IN,$ARGV[0]) or die ("$! : couldn't open $ARGV[0] for reading\ +n"); # slurp whole file into $text local $/ =undef; my $text=<IN>; close(IN) or die "$! : couldn't close $ARGV[0]\n"; # assume no nesting of tags $text=~s/(BEGTITLE.*?)wirey(.*?ENDTITLE)/$1smooth$2/gsi; open(OUT,'>',$ARGV[1]) or die ("$! : couldn't open $ARGV[1] for writin +g"); print OUT $text; close(OUT) or die "$! couldn't close $ARGV[1]\n";


    email: mandog

      While the above *may* be sufficient for the problem at hand, it is not a general solution. 1) it will only replace a single occurrence of the target pattern in a record, and 2) if there may be multiple title records in a given file, it can easily match across records (non-greedy matching does *not* prevent this) causing changes in non-target records, and/or missing changes in valid target records. Witness:

      #!/usr/bin/perl -w use strict; $/ = undef; my $text = <DATA>; $text =~ s/(BEGTITLE.*?)wirey(.*?ENDTITLE)/$1smooth$2/gsi; print $text; __DATA__ BEGTITLE The wirey life of a wirey haired dog ENDTITLE BEGTXT blah blah blah ENDTXT BEGTITLE Grooming dogs ENDTITLE BEGTXT Grooming a wirey haired dog is ... ENDTXT BEGTITLE Last wirey haired dog story ENDTITLE BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT # output is: BEGTITLE The smooth life of a wirey haired dog ENDTITLE BEGTXT blah blah blah ENDTXT BEGTITLE Grooming dogs ENDTITLE BEGTXT Grooming a smooth haired dog is ... ENDTXT BEGTITLE Last wirey haired dog story ENDTITLE BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT

      Notice, we only changed the first 'wirey' in the first title, inadvertently changed 'wirey' in the second text section, and missed the occurrence of 'wirey' in the third title. (because the second successful match started at the second BEGTITLE and went to the third ENDTITLE, incorporating the entire second BEGTXT record).

      Let's look at two other techniques (each with their own failings depending on the structure of the data). First, if we can assume that no line of data will contain more than one record (or partial record) --- but multi-line records are ok --- the solution is simple and involves the range/flip-flop operator:

      #!/usr/bin/perl -w use strict; while(<DATA>){ s/wirey/smooth/gi if /BEGTITLE/ .. /ENDTITLE/; print; } __DATA__ BEGTITLE The wirey life of a wirey haired dog ENDTITLE BEGTXT blah blah blah ENDTXT BEGTITLE Grooming dogs ENDTITLE BEGTXT Grooming a wirey haired dog is ... ENDTXT BEGTITLE Last wirey haired dog story ENDTITLE BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT

      That works so long as you never have multiple records on one line (or parts of multiple records). It can fail with multiple records on a line, or mixed partial records on a line as in:

      BEGPUB Wirey Haired Dogs Life ENDPUB BEGTITLE My dog has wirey hair ENDTITLE BEGTXT My wirey hair dog...blah, blah. ENDTXT or BEGTITLE My dog still has wirey hair ENDTITLE BEGTXT more wirey haired dog stuff ENDTXT

      I would argue that whoever or whatever produced data such as that should be put out of our misery. But ...

      To handle this more irregular data requires a little extra work --- namely, finding and extracting just the target record, replacing target words in the record, and then replacing the changed record back into the stream. You can do this in a variety of ways, one would be to use a double regex:

      #!/usr/bin/perl -w use strict; $/ = undef; while(<DATA>){ s{(BEGTITLE.*?ENDTITLE)} { my $rec = $1; $rec =~ s/wirey/smooth/ig; $rec; }gse; print; } __DATA__ BEGTITLE The wirey life of a wirey haired dog ENDTITLE BEGTXT blah blah blah ENDTXT BEGTITLE Grooming dogs ENDTITLE BEGTXT Grooming a wirey haired dog is ... ENDTXT BEGTITLE Last wirey haired dog story ENDTITLE BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT BEGPUB Wirey Haired Dogs Life ENDPUB BEGTITLE My dog has wirey hair ENDTITLE BEGTXT My wirey hair dog...blah, blah. ENDTXT BEGTITLE My dog still has wirey hair ENDTITLE BEGTXT more wirey haired dog stuff ENDTXT

      This handles all the example data shown so far, but wouldn't handle nested records, which can't be handled by quite such simple techniques (and which I won't bother to go into because it seems unlikely that these kinds of records are meant to be nested). Also note, you may want to surround the target word with \b anchors to avoid changing partial "words" (but that is a function of your pattern search, not the overall technique). Other problems crop up if the target pattern may match (or partially match) a target delimiter in which case one may separately capture delimiters and record text in the regex.

      Alternatively, you could also write a script that regularized the data first (putting newlines before and after each record delimiter so that simple line-by-line processing using the range op technique can be applied.

      perhaps this clears up the AM's follow-up post

        Million Thanks.... Hit the nail right on the head. I can finally make sense of the perlop/Range Operators section!
      Thanks to all... Your suggestions worked perfectly. After looking at the code, I could see where you were going straight off with the backtracking. This was much simpler than I thought.... My brain must have locked up 8^). Thanks again,
        DOOOH! I seemed to have missed something. I don't mean to be a pest, but something is not making sense. Anyway, I have made a few renditions of the code offered and every example seems to suffer from a greed problem. I have inserted a while loop to have the regex range over the data until all instances of wirey have been removed between the approprite tags. The same happens w/o the while loop if the in.txt looks like this:
        sample in.txt blah...blah...blah BEGTITLE My dog is stinkey ENDTITLE BEGTEXT My dog has wirey hair ENDTEXT BEGTITLE My dog's name is skip ENDTITLE blah...blah...blah
        The problem enters when wirey is replaced in the TEXT field. The code seems to gravitate towards testing the last instance of ENDTITLE rather that the 1st. Is this a nesting Issue? I was thinking that nesting is like this .BEGTITLE title BEGTITLE title2 ENDTITLE ENDTITLE, where there is not a balanced symmetry like you might find in a comma delimited file. In my case there is always a ENDTITLE before the appearence of the 2nd+ instance of a BEGTITLE tag. Thanks. I am just trying to learn as much of this as I can. I thought the (.*?) usage would keep the script from being greedy like this.
Re: Regex: find/Replace words between tags
by dthacker (Deacon) on Dec 15, 2001 at 12:53 UTC