Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Excluding groups of characters in regular expressions

by semirhage (Initiate)
on Dec 21, 2007 at 14:37 UTC ( [id://658453]=perlquestion: print w/replies, xml ) Need Help??

semirhage has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to remove any nested paragraph tags from a huge file.

I have come up with the following regex so far...

s/<paragraph>(.*?)<paragraph>(.*?)<\/paragraph>(.*?)<\/paragraph>/<paragraph>$1$2$3<\/paragraph>/ig;

This appears to work in many cases however it also removes every other pair of properly formatted paragraph tags...

e.g.

If the input was the following:

<paragraph>some data</paragraph><paragraph>more data</paragraph><paragraph>even more data</paragraph>

The regex would result in this being changed to:

<paragraph>some data</paragraph>more data<paragraph>even more data</paragraph>

After thinking about it, it makes sense since I am trying to match to four tag units within the text and the (.*?) doesn't exclude other paragraph tags from being included...

Is there anyway to exclude <paragraph> or </paragraph> from the (.*?) matches?

Thanks...

Tom
  • Comment on Excluding groups of characters in regular expressions

Replies are listed 'Best First'.
Re: Excluding groups of characters in regular expressions
by suaveant (Parson) on Dec 21, 2007 at 14:53 UTC
    You'd be better off treating this more as a parser problem than a regexp problem, because trying to match nested items in regexp is advanced stuff... here is some code that should get you started... not ideal but easy :)
    $text = "<paragraph>some <paragraph>some data</paragraph>data</paragra +ph><paragraph>more data</paragraph><paragraph>even more data</paragra +ph>"; my $depth = 0; $text =~ s{(<(/)?paragraph>)}{check($depth)}gie; print "$text\n"; sub check { if($2) { $_[0]--; if($_[0] == 0) { return $1; } $_[0] = 0 if $_[0] < 0; } else { $_[0]++; return $1 if $_[0] == 1; } return ''; }

                    - Ant
                    - Some of my best work - (1 2 3)

      A friend of mine was able to help me... here is the answer for anybody who needs help with this sort of thing in the future.

      $_='$0<paragraph>$1<paragraph>$2</paragraph>$3<paragraph>$4<paragraph> +$5</paragraph>$6</paragraph>$7</paragraph>$8<paragraph>$9</paragraph> +$10<paragraph>$11</paragraph>$12 '; ($re=$_)=~s/((<paragraph>)|(<\/paragraph>)|[^<]+|.)/${[')','']}[!$3]\Q +$1\E${['(','']}[!$2]/gs; $re=join"|",map{quotemeta}(eval{/$re/}); s{($re)}{local $_=$1;s#</?paragraph>##g;$_}eg; print;


      Tom
        Have you ever heard anyone discussing your friend's code before... and if so, was there a lot of swearing involved?

        No offense, but the code looks like an entry to an obfuscation contest. Not so bad if its a one off script, but barely maintainable if it's going to be around for a bit.

                        - Ant
                        - Some of my best work - (1 2 3)

Re: Excluding groups of characters in regular expressions
by Jaap (Curate) on Dec 21, 2007 at 14:50 UTC
    That kind of thing is pretty hard with regexes so people usually recommend using a real parser like HTML::Parser

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://658453]
Approved by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-24 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found