Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Writing a simple RSS feed 'grabber' with XML::Parser.

by DigitalKitty (Parson)
on Oct 20, 2004 at 07:42 UTC ( [id://400769]=perlquestion: print w/replies, xml ) Need Help??

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I am cultivating an interest in scraping RSS feeds and the following program is my first real attempt:

#!/usr/bin/perl -w use strict; use XML::Parser; use LWP::Simple; my $feed; open( FH, ">feed.xml") or die "Error: $!\n"; $feed = get("http://rss.news.yahoo.com/rss/science"); print FH $feed; my $parser = new XML::Parser ( Handlers => { Start => \&hdl_start, End => \&hdl_end, Char => \&hdl_char, } ); + $parser->parsefile("feed.xml"); sub hdl_start { my ($p, $ele, %attribs) = @_; $attribs{'string'} = ''; $feed = \%attribs; } sub hdl_end { + my ($p, $ele) = @_; display_feed($feed) if $ele eq 'title'; display_feed($feed) if $ele eq 'link'; } sub hdl_char { my ($p, $str) = @_; no strict 'refs'; $feed->{'string'} .= $str; } sub display_feed { my $attribs = shift; $attribs =~ s/\n//g; print "$attribs->{'string'}\n\n"; }


The last title tag doesn't generate a corresponding link and the following message appears once the feed is displayed:

no element found at line 594, column 12, byte 45056 at C:/Perl/site/lib/XML/Parser.pm line 185

Ideally, the program would display the feed in html formatted links with the title directly above them for easy navigation to the story in question.

I've used XML::RSS before but I decided to experiment with an additional module.

Thanks,
~Katie

Replies are listed 'Best First'.
Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by ajt (Prior) on Oct 20, 2004 at 11:02 UTC

    I wouldn't start any project with XML::Parser, it's a bit antique, and XML::LibXML is more feature rich and much faster parser to start any project with.

    Parsing RSS is a real pain, as it's often not well formed, so anything using a proper XML parser will die. XML::RSS and XML::RSS::Tools get round this by having a pre-filter in them that cleans up well know bad code, before attempting to pass the file onto the XML parser.

    The XML::RSS::Tools module (which I wrote) uses XML::RSS for parsing RSS, one of several HTTP clients for getting RSS feeds, and the XML::LibXSLT module for converting the feed into something else.

    Some useful nodes:


    --
    ajt
Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by demerphq (Chancellor) on Oct 20, 2004 at 10:04 UTC

    Hi DK. I'm trying to figure out what your objective here is. Are you trying to learn how XML::Parser works? Or are you trying to do something with RSS? I mean, if its the latter then I would do it like this:

    #!/usr/bin/perl -w use strict; use XML::Simple; use LWP::Simple; use Data::Dump::Streamer; $|++; my $ticker=['http://perlmonks.org/index.pl?node_id=30175&xmlstyle=rss' +, "http://rss.news.yahoo.com/rss/science"]->[rand 2]; print "Getting RSS from $ticker\n"; my $feed = get($ticker); print "Parsing RSS...\n"; my $ref = XMLin($feed); print "Dumping Parse Tree...\n"; Dump $ref;

    If its the former then I can't really help much beyond pointing out that what you are doing with the lexical var "$feed" in there scares the willies out of me.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


      And in today's "making Perl work a lot harder than you need to do", let's nominate the following entry:
      my $ticker=['http://perlmonks.org/index.pl?node_id=30175&xmlstyle=rss' +, "http://rss.news.yahoo.com/rss/science"]->[rand 2];
      So, we've asked Perl to construct an array, take a reference to it, then dereference that reference to pick out one of the items, then discard the reference, which then garbage-collects the array. All when we could have written that this way:
      my $ticker=('http://perlmonks.org/index.pl?node_id=30175&xmlstyle=rss' +, "http://rss.news.yahoo.com/rss/science")[rand 2];
      saving two characters of typing, and all that mess of creating the new array and reference and garbage collecting. We're simply constructing a list, then picking out an element of that list with a literal slice (a construct I suggested for Perl 3, by the way {grin}).

      To optimize this further, I'd go with a qw for that first list:

      my $ticker=(qw(http://perlmonks.org/index.pl?node_id=30175&xmlstyle=rs +s http://rss.news.yahoo.com/rss/science))[rand 2];
      And in recent versions of Perl, you can even drop that outer set of parens:
      my $ticker=qw(http://perlmonks.org/index.pl?node_id=30175&xmlstyle=rss http://rss.news.yahoo.com/rss/science)[rand 2];

      I saw bracket-arrow-bracket as a "cute syntax" once. I'm trying to stomp it out, because there's an equivalent construct (as I showed) that is a lot less work for Perl. Please don't propogate "cute syntax" that is more expensive.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        perl -MBenchmark=cmpthese -e'cmpthese -1, { cute => sub { [0,1]->[rand + 2] }, list => sub { qw(0 1)[rand 2] } }'
        Rate cute list cute 768000/s -- -74% list 2899719/s 278% --
        Yes, the list slice is much faster. But for something that probably runs only once in every 5 minutes, isn't 768000 per second fast enough? Optimizing seems premature here. For things like this, I am against choosing a particular language or syntax for its speed.

        I'm not saying that your reply is useless. It's important to know what code does and this information will certainly help some of the readers when they do have to optimize. But the code is written now and not much is gained by changing it, so I'd just let it be. Programmer time is still much more expensive than computer time.

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Yep, guilty as charged. But lighten up a little. It was just a snippet to make things a little more interesting.


        ---
        demerphq

          First they ignore you, then they laugh at you, then they fight you, then you win.
          -- Gandhi

          Flux8


        at least you have your disclaimer to guard against anyone thinking that you're an ass...
Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by gellyfish (Monsignor) on Oct 20, 2004 at 08:33 UTC

    The XML is not properly formed - probably a missing closing tag of an element. It would probably help if you could post the source of your RSS.

    /J\

Re: Writing a simple RSS feed 'grabber' with XML::Parser. (detailed review)
by demerphq (Chancellor) on Oct 20, 2004 at 18:40 UTC

    bobf asked me to expand on my scary comment in my original reply. Here goes.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by inman (Curate) on Oct 20, 2004 at 10:33 UTC
    I have use XML::RSSLite with some success. The following is a simple CGI that displays an RSS feed as a web page.
Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by Anonymous Monk on Apr 15, 2007 at 01:31 UTC
    Try XML::RAI
    #!/usr/bin/perl -w use strict; use LWP::Simple 'get'; use XML::RAI; my $rss = XML::RAI->parse(get(shift||die"please enter rss uri")); my $title = $rss->channel->title; my $link = $rss->channel->link; print "$title\n$link\n\n"; for my $item (@{$rss->items}) { $title = $item->title; $link = $item->link; print "$title\n$link\n"; }
    Also see XML::RSS::SimpleGen
Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by Your Mother (Archbishop) on Jan 25, 2009 at 05:51 UTC

    Check out XML::Feed if you haven't. I've been really happy with it.

Re: Writing a simple RSS feed 'grabber' with XML::Parser.
by Plankton (Vicar) on Jan 25, 2009 at 05:46 UTC
    I tried XML::RSS too found it to be over kill for what I wanted so I tried XML::RSS::Parser::Lite and was happy with it until I hit "CDATA" as many other monks have pointed out RSS feeds are not always "well formed" so I just ended up doing something like this ...
    ... use WWW::Mechanize; my $url = shift; # any .xml RSS feed url my $mech = WWW::Mechanize->new(); $mech->get( $url ); my @content = split /\n/, $mech->content; my $title_pattern = "<title>(.*?)</title>"; my $description_pattern = "<description>(.*?)</description>"; my @titletags = grep s/$title_pattern/$1/i, @content; my @descriptiontags = grep s/$description_pattern/$1/i, @content; my $thetitle=$titletags[0]; if ( $thetitle !~ s/<\!\[CDATA\[//g ) {} if ( $thetitle !~ s/Librivox\://g ) {} if ( $thetitle !~ s/]]>//g ) {} print "$thetitle\n"; my $thedescription=$descriptiontags[0]; if ($thedescription !~ s/<\!\[CDATA\[//g ) {} if ($thedescription !~ s/]]>//g ) {} print "$thedescription\n";
    Not the best general solution but it worked for me in my particular case.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://400769]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-25 19:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found