Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Writing a simple RSS aggregator.

by DigitalKitty (Parson)
on Dec 06, 2003 at 02:11 UTC ( [id://312708]=perlquestion: print w/replies, xml ) Need Help??

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I am in the process of writing an RSS aggregator for a professor and he is not willing to install modules himself or permit anyone else to do so ( annoying indeed ). Therefore, I am relegated to using regexes. In the process of testing the tool on thraxil.org, I noticed there is no output displayed. My code ( thus far ):

#!c:\perl\bin\perl.exe -w use strict; use LWP::Simple; use CGI qw( :standard ); print "Content-type: text/html\n\n"; print start_html; my $data = get("http://thraxil.org/rss"); my $scalar; open (F, ">test.txt") or die $!; print F $data; close F; open( F2, "<test.txt" ) or die "Error : $!\n"; while(<F2>) { if ( /<title>\s*(.*?)\s*<\/title><link>(.*?)<\/link>/m ) { print "<a href=$2>$1</a><br><br>"; } } close F2; print end_html();


The rss source I am trying to parse can be seen at http://thraxil.org/rss.
I need to capture the title, link, and description data then display each group of three with the <link> info as a hyperlink to the article / node.

I feel as though I am quite close but a little assistance would be quite beneficial.

Thanks,
-Katie.

Replies are listed 'Best First'.
Re: Writing a simple RSS aggregator.
by Zaxo (Archbishop) on Dec 06, 2003 at 02:43 UTC

    The xml in the feed is spread over several lines, but you're reading only one at a time. No one line matches all your regex.

    Try setting local $/ = '</item>'; before reading. The alternative is to forget the intermediate file, rely on the linebreaks, and do global matching a la,

    my $regex = /<title>(.*?)<\/title>\n<link>(.*?)<\/link>\n<description> +(.*?)\n<\/description>/; while ($data =~ /$regex/g) { #... }
    That is pretty fragile, however. I suspect you're doing this as a favor and it seems odd that you have to rewrite the good xml modules to do it.

    LWP::Simple is just as optional as the XML modules, which you should be able to use. There is even one for rss.

    After Compline,
    Zaxo

      Here are a couple of other methods that were inspired by Zaxo:

      Method #1

      #!/usr/bin/perl -w use strict; use LWP::Simple; use CGI qw( :standard ); require 5.8.0; print "Content-type: text/html\n\n"; print start_html; my $RSS = get("http://thraxil.org/rss"); { local $/ = "</item>"; open my $rss, "<", \$RSS or die "Aaiiigh - $!"; while (<$rss>) { my ($title) = m!<title>(.*?)</title>!is; my ($link) = m!<link>(.*?)</link>!is; my ($desc) = m!<description>(.*?)</description>!is; next unless $title && $link && $desc; print "Title: $title\nLink: $link\nDescription: $desc\n\n"; } close $rss; }

      Method #2

      #!/usr/bin/perl -w use strict; use LWP::Simple; use CGI qw( :standard ); print "Content-type: text/html\n\n"; print start_html; my $RSS = get("http://thraxil.org/rss"); my @items = $RSS =~ m!<item.*?>(.*?)</item>!gis; for (@items) { my ($title) = m!<title>(.*?)</title>!is; my ($link) = m!<link>(.*?)</link>!is; my ($desc) = m!<description>(.*?)</description>!is; next unless $title && $link && $desc; print "Title: $title\nLink: $link\nDescription: $desc\n\n"; }

      Each of these has its own merits, but if you want to do it right, use a real parser from CPAN. :-)

Re: Writing a simple RSS aggregator.
by thraxil (Prior) on Dec 06, 2003 at 02:50 UTC

    also, aside from parsing issues, i'd like to point out that an RSS aggregator is an HTTP client and should be polite by properly supporting HTTP response codes and using things like Etags and If-Modified-Since headers to not overload the server (especially when it's mine ;)

    for my RSS gathering and parsing, i actually prefer to use Mark Pilgrim's ultra-liberal feed parser, since he's even more anal about that stuff than i am. it doesn't require anything beyond the *cough* python core library, so if the server has python installed, that may be an option...

      Incidentally, Spidering hacks has Perl code for doing all the friendly etag, if-modified-since stuff. Which at some point I'll be building into WWW::Mechanize::Cached, along with Expires awareness.

Re: Writing a simple RSS aggregator.
by demerphq (Chancellor) on Dec 06, 2003 at 10:45 UTC

    Personally I think you should direct your Professor here to one of the many many different threads discussing the "you can't install modules" meme. A professor should know better. And certainly shouldnt be asking you to reinvent wheels to satisfy his perverse reluctance to stay up to date. Especially as insofar as pure perl modules go he doesn't have a leg to stand on. (And should really be made to know it.)


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


Re: Writing a simple RSS aggregator.
by thraxil (Prior) on Dec 06, 2003 at 02:33 UTC

    your regexp is never matching. the <link> and <title> are never on the same line in the feed.

    that's your immediate problem i think. parsing the feed with regexes, you're likely to run into plenty of other problems that i'm sure the other monks will point out.

Re: Writing a simple RSS aggregator.
by DigitalKitty (Parson) on Dec 06, 2003 at 03:33 UTC
    Thanks Zaxo and Anders.

    I'm quite the rss neophyte (as if that wasn't already obvious). *wink*

    -Katie.
Re: Writing a simple RSS aggregator.
by mtve (Deacon) on Dec 07, 2003 at 08:19 UTC

    try my aggregator

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://312708]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-03-28 15:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found