Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

XML::Parser breaks on

by r.joseph (Hermit)
on Aug 16, 2001 at 11:59 UTC ( [id://105300]=perlquestion: print w/replies, xml ) Need Help??

r.joseph has asked for the wisdom of the Perl Monks concerning the following question:

Hello again everyone,

Been a long time since my last post, but I am really stuck this time, and can't figure why.

I am using XML::Parser to parse .RSS documents from Linux.com and Newsforge.net for a "live news feed" if you will. First, below is a snippet from Linux.com's RSS doc:

<item> <title>Big software companies lose their minds!</title> <link>http://linux.com/newsitem.phtml?sid=1&amp;aid=12492</link> <description>Linux.com corresponent Mark Miller has some views on big software companies.</description> </item>

Now, if you look at the <link> tag, you see that there is a proper sequence, &amp; to represent an ampersand. However, here is the problem. When XML::Parser encounters this chunk of data, it calls the Char handler, whatever you define it to be. Mine happens to be very simple, atleast right now (BTW, I am using the Subs style for the parser, but tha shouldn't matter):

sub found_char { my ($ex, $str) = @_; if ($ex->in_element('link') && $ex->within_element('item')) { print "\t\tLink: $str\n"; } }

So I should expect a simple string that has Link: and then the link, whatever that may be. However, it seems that XML::Parser instead, for some reason, splits on that escape sequence, so I get this output:

Link: http://linux.com/newsitem.phtml?sid=1 Link: & Link: aid=12492

What I CANNOT figure out is why it seems to consider that string within the <link> element three strings!

Does anyone know how this can be fixed - I have seen this problem happen with other "non-element" data, and I just want it to grab all of the pertient data at one time.

Thanks a ton!

r. j o s e p h
"Violence is a last resort of the incompetent" - Salvor Hardin, Foundation by Issac Asimov

Replies are listed 'Best First'.
Re: XML::Parser breaks on
by mirod (Canon) on Aug 16, 2001 at 12:09 UTC

    This is a documented behaviour of XML::Parser. Actually XML::Parser documents the fact that this "can" happen. It actually happens for every entity, line break and expat input buffer boundary crossed. The review gives you a way to deal with it: basically you cannot use the data in the char handler, you just buffer it until you hit a tag (open _or_ close).

    By the way, did you try XML::RSS? Maybe it would make it easier for you to process your data.

Re: XML::Parser breaks on
by blakem (Monsignor) on Aug 16, 2001 at 12:14 UTC
    I think you'll have to glue it all together. Here is a snippet that might help a bit.

    sub xml_char { my ($xp, $txt) = @_; my $el = $xp->current_element(); $val{$el} .= $txt if $txt =~ /\S/; }

    Notice that %val is sort of a buffer area that will need to be cleaned up when you hit the end tag.

    -Blake

      I've used something like this before, and like the general technique, but I question:
      $val{$el}.=$txt if $txt =~ /\S/;
      Since the parser is actually allowed to break anywhere you could lose intra-word spaces or newlines (if they're signicficant).
        You're probably right and its a piece of code I haven't looked at in a long time. I do remember there being a reson for it, but can't remember it right now. Anyone looking at using this, should probably get rid of the if conditional.

        -Blake

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://105300]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-03-28 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found