The question came to me one afternoon, "how long would it take you to write a Perl script that would remove a specific character from within XML tags?"
I replied,
A quick and dirty (i.e., error prone) would take minutes.
To use a real XML parser and guarantee that I don’t corrupt the file in the process might take a couple hours because I’m actually not that familiar with XML.
I wouldn’t be surprised if you could ask the question politely at perlmonks.org and get it written for free in about a half hour.
This was a need-it-now situation, so we went with the quick and dirty:
use strict;
use warnings;
while (<>) {
s{(<[^?<>]*\.[^<>]*>)}{
(my $tagname = $1 ) =~ tr/.//d;
$tagname;
}eg;
print;
}
The input to deal with looks like this:
<?xml version="1.0"?>
<TOP>
<SUB>
<THIS>STUFF</THIS>
<SOME.TYPE>T</SOME.TYPE>
<SOME.OTHER.TYPE>BLAH</SOME.OTHER.TYPE>
</SUB>
</TOP>
The problem is the dots in the tag names. They need to be stripped out. The output should look like this:
<?xml version="1.0"?>
<TOP>
<SUB>
<THIS>STUFF</THIS>
<SOMETYPE>T</SOMETYPE>
<SOMEOTHERTYPE>BLAH</SOMEOTHERTYPE>
</SUB>
</TOP>
Note that my first implementation actually took the dot out of "<?xml version="1.0"?>". Luckily I had the good sense to look at a 'diff' before I stopped debugging.
So, monks, I seek your wisdom. What is the right way to do this so that I don't someday accidentally annihilate some important input? Any guidance you can offer would be appreciated.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|