While the above *may* be sufficient for the problem at hand, it is not
a general solution. 1) it will only replace a single occurrence
of the target pattern in a record, and 2) if there may be multiple
title records in a given file, it can easily match across records
(non-greedy matching does *not* prevent this) causing changes in non-target
records, and/or missing changes in valid target records. Witness:
#!/usr/bin/perl -w
use strict;
$/ = undef;
my $text = <DATA>;
$text =~ s/(BEGTITLE.*?)wirey(.*?ENDTITLE)/$1smooth$2/gsi;
print $text;
__DATA__
BEGTITLE The wirey life of a wirey haired dog ENDTITLE
BEGTXT blah blah blah ENDTXT
BEGTITLE Grooming dogs ENDTITLE
BEGTXT Grooming a wirey haired dog is ... ENDTXT
BEGTITLE Last wirey haired dog story ENDTITLE
BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT
# output is:
BEGTITLE The smooth life of a wirey haired dog ENDTITLE
BEGTXT blah blah blah ENDTXT
BEGTITLE Grooming dogs ENDTITLE
BEGTXT Grooming a smooth haired dog is ... ENDTXT
BEGTITLE Last wirey haired dog story ENDTITLE
BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT
Notice, we only changed the first 'wirey' in the first title,
inadvertently changed 'wirey' in the second text section, and missed
the occurrence of 'wirey' in the third title. (because the second
successful match started at the second BEGTITLE and went to the third
ENDTITLE, incorporating the entire second BEGTXT record).
Let's look at two other techniques (each with their own failings
depending on the structure of the data). First, if we can assume
that no line of data will contain more than one record (or partial
record) --- but multi-line records are ok --- the solution is simple
and involves the range/flip-flop operator:
#!/usr/bin/perl -w
use strict;
while(<DATA>){
s/wirey/smooth/gi if /BEGTITLE/ .. /ENDTITLE/;
print;
}
__DATA__
BEGTITLE The wirey life of
a wirey haired dog ENDTITLE
BEGTXT blah
blah blah ENDTXT
BEGTITLE Grooming dogs ENDTITLE
BEGTXT Grooming a wirey haired dog is ... ENDTXT
BEGTITLE Last wirey haired dog story ENDTITLE
BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT
That works so long as you never have multiple records on one line (or
parts of multiple records). It can fail with multiple records on a line,
or mixed partial records on a line as in:
BEGPUB Wirey Haired Dogs Life ENDPUB BEGTITLE My dog
has wirey hair ENDTITLE
BEGTXT My
wirey hair dog...blah, blah.
ENDTXT
or
BEGTITLE My dog
still has wirey
hair ENDTITLE BEGTXT more wirey
haired dog stuff ENDTXT
I would argue that whoever or whatever produced data such as that
should be put out of our misery. But ...
To handle this more irregular data requires a little extra work --- namely,
finding and extracting just the target record, replacing target words
in the record, and then replacing the changed record back into the
stream. You can do this in a variety of ways, one would be to use a
double regex:
#!/usr/bin/perl -w
use strict;
$/ = undef;
while(<DATA>){
s{(BEGTITLE.*?ENDTITLE)}
{ my $rec = $1;
$rec =~ s/wirey/smooth/ig;
$rec;
}gse;
print;
}
__DATA__
BEGTITLE The wirey life of a wirey haired dog ENDTITLE
BEGTXT blah blah blah ENDTXT
BEGTITLE Grooming dogs ENDTITLE
BEGTXT Grooming a wirey haired dog is ... ENDTXT
BEGTITLE Last wirey haired dog story ENDTITLE
BEGTXT we don't need no steenkin' wirey haired dogs here ENDTXT
BEGPUB Wirey Haired Dogs Life ENDPUB BEGTITLE My dog
has wirey hair ENDTITLE
BEGTXT My
wirey hair dog...blah, blah.
ENDTXT
BEGTITLE My dog
still has wirey
hair ENDTITLE BEGTXT more wirey
haired dog stuff ENDTXT
This handles all the example data shown so far, but wouldn't handle
nested records, which can't be handled by quite such simple techniques
(and which I won't bother to go into because it seems unlikely that
these kinds of records are meant to be nested). Also note, you may
want to surround the target word with \b anchors to avoid
changing partial "words" (but that is a function of your pattern
search, not the overall technique). Other problems crop up if the
target pattern may match (or partially match) a target delimiter in
which case one may separately capture delimiters and record
text in the regex.
Alternatively, you could also write a script that regularized the
data first (putting newlines before and after each record delimiter
so that simple line-by-line processing using the range op technique
can be applied.
perhaps this clears up the AM's follow-up post
|