Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Use of one of the fine CPAN XML parsing modules is almost certainly the best course. Perhaps some monk better versed than I in XML parsing can suggest appropriate choices. Novice monks often protest that these XML modules represent "too much code for my application" and want "just a simple" solution. This desire is usually a snare and a delusion: XML is complicated, and "simple" solutions are fragile and scale poorly.

However, if you are set on a simple solution, here are a couple of regex-based ones. Both can operate on strings containing embedded double-quotes and other stuff. Again, both are inherently fragile. The second approach is both more specific as to the tags to be deleted and more tolerant of tag casing and whitespace.

Update: Changed following code example to be more Windose double-quote-friendly.

>perl -wMstrict -le "my $s = '<a>foo</a><bc>bar</bc> <def>baz</def> \"x\" <ghij>%&*</ghij>'; print qq{'$s'}; ;; $s =~ s{ < ([^>]+) > ((?: (?! </ \1) .)*) </ \1 > }{$2}xmsg; print qq{'$s'}; ;; ;; $s = '<B>foo</ b > <efg>bar</efg> \"stuff\" <cD >*&!</ Cd>'; print qq{'$s'}; ;; my @tags = qw(b cd); my $tag = join '|', @tags; $tag = qr{ (?i) $tag }xms; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } {$2}xmsg; print qq{'$s'}; " '<a>foo</a><bc>bar</bc> <def>baz</def> "x" <ghij>%&*</ghij>' 'foobar baz "x" %&*' '<B>foo</ b > <efg>bar</efg> "stuff" <cD >*&!</ Cd>' 'foo <efg>bar</efg> "stuff" *&!'

Update: I just noticed the "and their content" requirement in the OPed title and output examples. Here's a two-pass regex solution (Update: Changed to make more modular, self-documenting):

>perl -wMstrict -le "my $s = '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>'; print qq{'$s'}; ;; my $ar_tag_delete_content = [ 1, tag_group_regex(qw(efg) ) ]; my $ar_tag_leave_content = [ 0, tag_group_regex(qw(b cd)) ]; ;; for my $pass ($ar_tag_leave_content, $ar_tag_delete_content) { my ($delete_content, $tag) = @$pass; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } { $delete_content ? '' : $2 }xmsge; print qq{'$s'}; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>'

Further Update:
Hey, wait a minute...
Does the foregoing even work?
Answer: No. Try it with the string  '<b>foo</B> bar <b>baz</B>' and it falls over.

The following works better, is simpler, and also gets rid of the quite unnecessary  (?(?{ lc($1) ne lc($^N) }) (*F)) business. (But this is still quite naive and fragile code for processing XML!)

>perl -wMstrict -le "my @strings = ( '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>', '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S', '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z', ); ;; my $ar_keep_tag_content = [ 1, tag_group_regex(qw(b cd)) ]; my $ar_kill_tag_content = [ 0, tag_group_regex(qw(efg) ) ]; ;; for my $s (@strings) { print qq{'$s'}; for my $pass ($ar_keep_tag_content, $ar_kill_tag_content) { my ($keep_content, $tag) = @$pass; $s =~ s{ < \s* ($tag) \s* > (.*?) </ \s* (?i) \1 \s* > } { $keep_content ? $2 : '' }xmsge; print qq{'$s'}; } print ''; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>' '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q R S' '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X Y Z'

In reply to Re: remove xml tag and their content by AnomalousMonk
in thread remove xml tag and their content by zac_carl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (9)
As of 2024-04-25 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found