http://qs321.pair.com?node_id=828156

liverpole has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks,

I've just started working on a project at work that requires the parsing of XML.  I've never used an XML parser before this week.

After considering XML::Twig, I decided to focus on XML::Simple instead.  For one thing, it seemed easier to delve into XML parsing with.  For another, it's clear that the XML this project involves is really basic stuff; no arrays of data, nor even any nested data after the root.  (Although I'm aware that I may be being naive).

I've read through the module documentation a few times, and figured out how to preserve the root of the XML with 'KeepRoot' => 1, as well as what other arguments are required if I want to go with the qw{ :strict } option.

On the very first set of actual XML data that I tried, there happened to be an error in the input.  Here's my example program with simplified test XML at the end that still exhibits the error:

#!/usr/bin/perl -w ############### ## Libraries ## ############### use strict; use warnings; use XML::Simple qw{ :strict }; ################## ## User-defined ## ################## my $h_xml_args = { 'forcearray' => [ ], 'keyattr' => [ ], 'KeepRoot' => 1, }; ################## ## Main program ## ################## chomp(my @xml = <DATA>); my $xml = join("\n", @xml); # Read and display the input XML print "[Input XML]\n"; print "-" x 79, "\n"; for (my $i = 0; $i < @xml; $i++) { printf " %3d. %s\n", $i+1, $xml[$i]; } print "-" x 79, "\n\n"; # Parse the XML with XML::Simple my $h_xml = eval { XMLin($xml, %$h_xml_args) }; if ($@) { die "Error while parsing XML:\n$@\n"; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE MSG PUBLIC "SYSTEM" "MESSAGE.dtd"> <ROOT> <TYPE>notification</TYPE> <IDENT>1972308645</IDENT> <STATE>processed</STATE> <HEADER>Example of XML::Simple</HEADER> <ERROR>The error: no closing "/ERROR" tag<ERROR> <MORE_INFO>More info</MORE_INFO> <STILL_MORE>Still more info</STILL_MORE> <AND_SO_ON>And so on ...</AND_SO_ON> </ROOT>

Note that the 8th line of XML is incorrect, as the closing tag is missing its slash "/" character:

<ERROR>The error: no closing "/ERROR" tag<ERROR>

The error this program gives me back is:

Error while parsing XML: mismatched tag at line 12, column 2, byte 377 at /usr/lib/perl5/site_p +erl/5.8.8/ i386-linux-thread-multi/XML/Parser.pm line 187

which seems awfully non-specific.  I had to visually scan through the XML to see that the error was really on line 8 of the XML, not on line 12 (the final line).

In actuality, the XML will be much longer, of course, and it would be nice to provide something along the lines of:

Error while parsing XML: The tag <ERROR> (line 8) was never closed.

or some such.

My situation is very like an XML beginner's confusion.  A Google search for the error message I got didn't seem to help.  Can anyone enlighten me as to the best way to pinpoint exactly where the error occurred?  Is there something in XML::Simple that I've overlooked?  Should I be using a different module?


s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

Replies are listed 'Best First'.
Re: XML::Simple giving a non-specific error
by ikegami (Patriarch) on Mar 12, 2010 at 00:00 UTC
    You could edit the module to display which tag was mismatched. The error probably occurred at the last occurrence of that tag before the given line. You could maybe add some meta data to provide an even clearer message.

    Update: Well, it appears it's not simple to update the message as it originates in expat, an external library. But that means you'll get a different message if you use a different backend for XML::Simple. For example, if you change

    $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
    to
    $XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX';
    You get
    Entity: line 12: parser error : Opening and ending tag mismatch: ERROR line 8 and ROOT
    </ROOT>
           ^
    Entity: line 13: parser error : Premature end of data in tag ERROR line 8
    
    ^
    Entity: line 13: parser error : Premature end of data in tag ROOT line 3
    
    ^
    

    My test shows that XML::Parser is the fastest backend for XML::Simple, but you can switch to a slower parser that gives better error messages for debugging.

      ... if you change
      $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
      to
      $XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX';

      That output certainly looks more like what I want, even though I never said my code contained:

      $XML::Simple::PREFERRED_PARSER = 'XML::Parser';
      I've just installed XML::SAX with ppm, but the addition of this line:
      $XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX';
      throws the error:
      Can't locate object method "new" via package "XML::SAX" at C:/Perl/sit +e/lib/XML/ SAX/ParserFactory.pm line 41, <DATA> line 13.

      But thanks, that does give me an alternative that I can do more research on.   And as I said, the XML is really basic (and reasonably short), so I'm quite certain speed won't be an issue.

      Update:  On a hunch I tried "ppm install XML::LibXML::SAX", and that got me past the "Cannot locate object" error above.  Now I'm getting a much nicer error message:

      Error while parsing XML: Opening and ending tag mismatch: ERROR line 8 and ROOT Premature end of data in tag ERROR line 8 Premature end of data in tag ROOT line 3 at C:/Perl/site/lib/XML/LibXM +L/SAX.pm l ine 64 at C:/Perl/lib/XML/Simple.pm line 370

      So ++thanks again; it appears to fit my needs nicely.


      s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

        I never said my code contained $XML::Simple::PREFERRED_PARSER = 'XML::Parser';

        I realise it wasn't there explicitly, but by virtue of not setting it explicitly and having XML::Parser installed, it's as if you had.

        Can't locate object method "new" via package "XML::SAX

        Hum, weird. Did you also install XML-LibXML to provide XML::LibXML::SAX? It would be a bad failure mode if not having XML::LibXML::SAX is the cause of that error, but you never know.

        ( I see from your update that this was indeed the problem )

Re: XML::Simple giving a non-specific error
by crashtest (Curate) on Mar 12, 2010 at 00:31 UTC

    Should I be using a different module?
    I've been told: yes.

    I recently recommended XML::Simple in another thread, only to have Your Mother strongly advise against it. I haven't had a chance to use XML::LibXML, XML::Twig and their ilk, so I can't comment on their quality. I have used both XML::Parser and XML::Simple, and can attest that I don't really like them - it's mostly the API I find clunky, although I can't complain about bugs or crashes.

    What I can say is that the next time I embark on an XML project, I'll be taking Your Mother's advice and looking into the more modern alternatives on CPAN.

    Some food for thought.

      That is a good plan, crashtest. For general-purpose parsing, I recommend XML::Twig for the combination of the following reasons:
      • The API is superior to the others you have tried, offering method calls and xpaths.
      • The extended documentation is fairly easy to follow, well-organized and comprehensive.
      • The tutorial is easy to complete and extremely useful.
      • The author is a PerlMonk and offers prompt support here at the Monastery.
      I started out with XML::Parser and stuck with it for a while because I didn't even know others existed. A collegue made me aware of XML::Simple, but I have never used it because it is too limited for the XML I have.

      When I discovered the CPAN Recommended Modules website, I tried using XML::LibXML, but I quickly became flustered. Then, I tried XML::Twig, and I haven't needed anything else since.

      I marked this node as OT because it has nothing to do with solving the OP's specific problem regarding getting a parser to return more useful diagnostic messages.

Re: XML::Simple giving a non-specific error
by almut (Canon) on Mar 12, 2010 at 00:11 UTC

    I think that from a parser's point of view there isn't really much more that could be done (except maybe reporting the tag name).  Unless you have a DTD that disallows <ERROR> tags to be nested etc. (and XML::Simple would actually do validation), the parser cannot tell that there's an error (i.e. unclosed <ERROR> tag) before having reached the end of file.

    Update: I've never used the module myself, but maybe XML::Simple::DTDReader can help with the issue...

      You seem to be saying that being unable to determine that an error occurred before having reached the end of file means it can't be reported accurately. That's not the case, as seen in the update to my post.

      By the way, the error wasn't reported at the end of the file, it was reported when the closing tag of the parent element (</ROOT>) was found.

        By the way, the error wasn't reported at the end of the file

        Judging by the byte position (377), it was (the closing angle bracket of </ROOT> is byte 375(*)).  I don't know why the line number is reported one less than it should be — maybe the <?xml ...?> header isn't being counted.

        As for your other point, I think you're right if the parser would keep track of all starting positions of so far unclosed tags.

        ___

        (*) assuming unix newlines, which I did after having seen i386-linux-thread-multi in the OP's error message.

        That's not the case, as seen in the update to my post.

        Your update shows how LibXML, a parser which builds a tree (takes more memory), can provide better error messages than a simpler parser like expat.

Re: XML::Simple giving a non-specific error
by FloydATC (Deacon) on Mar 12, 2010 at 13:34 UTC
    I'd say that the reason why the parser does not complain about line 8 is that it has no valid reason to say an ERROR element cannot contain another ERROR element.

    In order for the parser to know this you would have to use a DTD to describe the valid document tree.

    Consider this syntactically valid set of tags:

    <ERROR> <FOO> </FOO> <ERROR> <BAR> </BAR> </ERROR> </ERROR>
    Now, if you remove the last line, the XML becomes invalid. It would be tempting to fix the problem in line 4 but this would give a completely different document structure.

    -- Time flies when you don't know what you're doing