"be consistent" | |
PerlMonks |
Re: Re: Regex: find/Replace words between tags (non greedy re )by danger (Priest) |
on Dec 16, 2001 at 02:57 UTC ( [id://132268]=note: print w/replies, xml ) | Need Help?? |
While the above *may* be sufficient for the problem at hand, it is not a general solution. 1) it will only replace a single occurrence of the target pattern in a record, and 2) if there may be multiple title records in a given file, it can easily match across records (non-greedy matching does *not* prevent this) causing changes in non-target records, and/or missing changes in valid target records. Witness:
Notice, we only changed the first 'wirey' in the first title, inadvertently changed 'wirey' in the second text section, and missed the occurrence of 'wirey' in the third title. (because the second successful match started at the second BEGTITLE and went to the third ENDTITLE, incorporating the entire second BEGTXT record). Let's look at two other techniques (each with their own failings depending on the structure of the data). First, if we can assume that no line of data will contain more than one record (or partial record) --- but multi-line records are ok --- the solution is simple and involves the range/flip-flop operator:
That works so long as you never have multiple records on one line (or parts of multiple records). It can fail with multiple records on a line, or mixed partial records on a line as in:
I would argue that whoever or whatever produced data such as that should be put out of our misery. But ... To handle this more irregular data requires a little extra work --- namely, finding and extracting just the target record, replacing target words in the record, and then replacing the changed record back into the stream. You can do this in a variety of ways, one would be to use a double regex:
This handles all the example data shown so far, but wouldn't handle nested records, which can't be handled by quite such simple techniques (and which I won't bother to go into because it seems unlikely that these kinds of records are meant to be nested). Also note, you may want to surround the target word with \b anchors to avoid changing partial "words" (but that is a function of your pattern search, not the overall technique). Other problems crop up if the target pattern may match (or partially match) a target delimiter in which case one may separately capture delimiters and record text in the regex. Alternatively, you could also write a script that regularized the data first (putting newlines before and after each record delimiter so that simple line-by-line processing using the range op technique can be applied. perhaps this clears up the AM's follow-up post
In Section
Seekers of Perl Wisdom
|
|