Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Parse with XML::Simple: how to keep some tags "unparsed"?

by dda (Friar)
on Jul 01, 2004 at 10:08 UTC ( [id://371025]=perlquestion: print w/replies, xml ) Need Help??

dda has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I need to parse a simple XML file which contains HTML tags, for example:

<page id="1"> <content> This is <div class="red">some HTML text</div> </content> <page>
Is there a way to specify that everyting inside <content> element should not be parsed using XML::Simple? If no, which module should I use instead?

Thank you for your help.

--dda

Replies are listed 'Best First'.
Re: Parse with XML::Simple: how to keep some tags "unparsed"?
by bageler (Hermit) on Jul 01, 2004 at 14:30 UTC
    others have said to use CDATA but not given illustration for you:
    <page id="1"> <content><![CDATA[ Ths is <div class="red">some HTML text</div>]]> </content> </page>
Re: Parse with XML::Simple: how to keep some tags "unparsed"?
by tinita (Parson) on Jul 01, 2004 at 10:13 UTC
    if i understand you correctly you don't want to parse the <content>-tags because you want to save time.
    then it would probably be better to use XML::Parser.
    also have a look at http://perl-xml.sourceforge.net/ for FAQ and examples.
      No, it is not a matter of time. I need to keep all XHTML contents of <content> tags in a single place, and do not parse it into perl data structures.

      --dda

        If you are embedding data that has >'s and <'s, then you prolly wanna use the CDATA directive/option/thing within your xml to denote, "this is data of the XML document, not part of the XML structure".

        If that's beyond your control, you can always create a SAX parser that does just what you want.

        Or you can write some XSLT that transforms the content nested data into what I described above.

        Bart: God, Schmod. I want my monkey-man.

Re: Parse with XML::Simple: how to keep some tags "unparsed"?
by pbeckingham (Parson) on Jul 01, 2004 at 13:03 UTC

    This is not valid XML. You have a tag, "<content>" that contains both a value "This is", and a child tag "<div class="red">some HTML text</div>". Pick one, or hide the <> characters with &lt;&gt;, or do the right thing and use CDATA.

    Update: I stand corrected. I just checked the XML spec (http://www.w3.org/TR/2004/REC-xml-20040204) and gellyfish and ktingle are correct. Sorry.

      It actually is valid - an node is allowed to have mixed content. This snippet will give rise to a schema like:

      <?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="page"> <xs:complexType> <xs:sequence> <xs:element name="content"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="div"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute use="required" type="xs:string" na +me="class" /> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute use="required" type="xs:unsignedByte" name="id" /> </xs:complexType> </xs:element> </xs:schema>

      However this is certainly not what was intended - if the contents of <content /> are to be taken literally it should be a CDATA section.

      /J\

      An element can have a value and a child element, check the XML spec. Its awkward, but valid XML.
Re: Parse with XML::Simple: how to keep some tags "unparsed"?
by abclex (Monk) on Jul 01, 2004 at 19:51 UTC

    Did you take a look at XML::Twig? AFAIK it's very flexible in parsing/filtering tags.

      use XML::Twig; my $xml = '<?xml version="1.0" ?> <page id="1"> <content> This is <div class="red">some HTML text</div> </content> </page>'; my $twig = XML::Twig->new( twig_handlers => { content => sub { $_->print; print "\n"; }, }, ); $twig->parse($xml); $twig->purge;
Re: Parse with XML::Simple: how to keep some tags "unparsed"?
by Kyoichi (Novice) on Jul 05, 2004 at 00:26 UTC
    Heya dda
    Using CDATA it's a good choice, but you may want to check XML::Smart perhaps?

    --
    Kyoichi

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://371025]
Approved by tinita
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-04-16 11:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found