Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Slashdot headlines

by moxliukas (Curate)
on May 29, 2002 at 14:27 UTC ( [id://170090]=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utilities
Author/Contact Info moxliukas <moxliukas AT delfi DOT lt>
Description: Well, I'm only a beginner in Perl, but I have whacked up a script that extracts headlines from Slashdot. Yes, it is very probably buggy and definately one can come up with a more elegant solution. It uses standard modules from CPAN, such as LWP::UserAgent, HTTP::Request and HTML::TreeBuilder.
use HTTP::Request;
use LWP::UserAgent;
use HTML::TreeBuilder;

$ua = LWP::UserAgent->new;
$request = HTTP::Request->new(GET => 'http://slashdot.org/');
$resp = $ua->request($request);

$tree = HTML::TreeBuilder->new;
$tree->parse($resp->content);

my @titles;
my @authors;
my @dept;

foreach my $h ($tree->look_down('_tag', 'font', 
    sub {
      return unless ($_[0]->attr('face') and $_[0]->attr('size') and $
+_[0]->attr('color'));
      my @c = $_[0]->content_list;
      return unless ref $c[0];
      my @d = $c[0]->content_list;
      return 1 if not ref $d[0];
      $#c == 0 and $c[0]->tag eq 'b' and $d[0]->tag ne 'map';
    }
  )) {
  push @titles, $h->as_text;
}
foreach my $h ($tree->look_down('_tag', 'b', 
    sub {
      my @c = $_[0]->content_list;
      @c == 3 and ref $c[1] and $c[1]->tag eq 'a' and $c[0] =~ /^Poste
+d by/;
    }
  )) {
  push @authors, $h->as_text;
}
foreach my $h ($tree->look_down('_tag', 'b', 
    sub {
      my @c = $_[0]->content_list;
      @c == 1 and $c[0] =~ /^from the .* dept\.$/;
    }
  )) {
  push @dept, $h->as_text;
}

for($i=0; $i<10; $i++) {
  print "$titles[$i]\n";
  print "$authors[$i]\n";
  print "$dept[$i]\n\n";
}

$tree = $tree->delete;
Replies are listed 'Best First'.
•Re: Slashdot headlines
by merlyn (Sage) on May 29, 2002 at 18:00 UTC
    Here's a slightly simpler version that I whipped up in a few minutes, requiring XML::RSS and Template as well as LWP:
    #!/usr/bin/perl -w use strict; my $SLASHCACHE = "/Users/merlyn/slashdot.rdf"; use LWP::Simple; use Template; mirror("http://slashdot.org/slashdot.rdf", $SLASHCACHE); my $t = Template->new; $t->process(\*DATA, { slashcache => $SLASHCACHE} ) or die $t->error; __END__ === Here's the news: [% USE news = XML.RSS(slashcache); FOREACH item = news.items; LAST IF loop.count > 10; -%] [% loop.count %]: [% item.title %] at [% item.link %] [% END -%] ===

    -- Randal L. Schwartz, Perl hacker

      While merlyn's solution works well, I prefer a more XML native approach. The RDF file should be valid XML, in which case you only need apply an XSL Transformation to the file once you have it, to get valid XHTML, WML or whatever out.

      This is the basis of XML::RSS::Tools, a module I've been developing. Here is an example of the usage:

      #!/usr/bin/perl -w use strict; use XML::RSS::Tools; my $site = shift; my $xsl = shift; my $rss = XML::RSS::Tools->new; $rss->rss_uri($site) or die $rss->as_string('error'); $rss->xsl_file($xsl) or die $rss->as_string('error'); if ($rss->transform) { print $rss->as_string };

      There are however problems:

      • XML::RSS module has some defects
      • Perl before 5.7.x isn't exactly the best for working with Unicode
      • RSS feeds often are not valid XML - How do I clean RSS feeds to make them usable?
      • My module still has bugs in... :-(
      • UPDATE: Versions later than 0.05 have less bugs, and partially get round the defect in XML::RSS and dirty feeds

      My humble 2p

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://170090]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (None)
    As of 2024-04-25 03:58 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found