Writing a news retrieval application.

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I have considered writing a news retrieval application for a few weeks. Essentially, I envision it has having the following capabilities:

When 'fed' a list of keywords, it will search an array of news related sites ( http://news.yahoo.com, http://www.google.com, etc.) for stories that contain the keywords mentioned above.
Since the articles are often (if not consistantly) accessible behind a hyperlink, my program needs to be able to follow them. One prime example being http://www.perl.com. It has an article entitled 'A Chromosome at a Time with Perl, Part 1'. The fact that it is behind a link adds another layer of complexity to my objective.
After having made it's daily 'sweeps', each complete article would be stored on my local machine in HTML form so I could peruse them at my leisure.

I am considering using WWW::Mechanize but having never used it, I would greatly appreciate advice from those who have or any feedback from those who might suggest an alternate course of action. Included below, is my fledging foray into this vast arena.

use warnings;
use strict;
use LWP::UserAgent;

my $agent    = LWP::UserAgent->new();
my $site     = 'http://www.perl.com';
my $response = $agent->get( $site );

my $content  = $response->content();

if( $content =~ m/Chromosome/i ) {
  open( FH, ">>news.html" ) || die "Error : $!\n";
  print FH $content;

  close( FH ) || die "Error : $!\n";
}

else {
  print "Nothing!\n";
}
[download]

Thanks,
-Katie.

Comment on Writing a news retrieval application. Download Code

Replies are listed 'Best First'.

Re: Writing a news retrieval application.
by PodMaster (Abbot) on Oct 27, 2003 at 07:03 UTC


C:\new\WWW-Mechanize-Shell-0.29>perl -MWWW::Mechanize::Shell -e shell
Module File::Modified not found. Automatic reloading disabled.
>get http://perl.com/
Retrieving http://perl.com/(200)
http://perl.com/>open /Chromosome/
83: A Chromosome at a Time with Perl, Part 2
99: A Chromosome at a Time with Perl, Part 1
http://perl.com/>open 83
(200)
http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>content bioin
+formatics2.html
http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>back
http://perl.com/>open 99
(200)
http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>content bioin
+formatics1.html
http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>script bionfo
+rmatics.pl
http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>q

C:\new\WWW-Mechanize-Shell-0.29>dir bio*
 Directory of C:\new\WWW-Mechanize-Shell-0.29

10/26/2003  11:05p              38,616 bioinformatics.html
10/26/2003  11:08p              38,617 bioinformatics1.html
10/26/2003  11:08p              30,658 bioinformatics2.html
10/26/2003  11:09p                 721 bionformatics.pl


C:\new\WWW-Mechanize-Shell-0.29>cat bionformatics.pl
#!C:\Perl\bin\perl.exe -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::FormFiller;
use URI::URL;

my $agent = WWW::Mechanize->new();
my $formfiller = WWW::Mechanize::FormFiller->new();
$agent->env_proxy();

  $agent->get('http://perl.com/');
    $agent->form(1) if $agent->forms and scalar @{$agent->forms};
  $agent->follow('83');
  { my $filename = q{bioinformatics2.html};
        local *F;
    open F, "> $filename" or die "$filename: $!";
        binmode F;
        print F $agent->content,"\n";
        close F
  };
  $agent->back();
  $agent->follow('99');
  { my $filename = q{bioinformatics1.html};
        local *F;
    open F, "> $filename" or die "$filename: $!";
        binmode F;
        print F $agent->content,"\n";
        close F
  };

C:\new\WWW-Mechanize-Shell-0.29>
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Writing a news retrieval application.
by pg (Canon) on Oct 27, 2003 at 06:54 UTC

You can simply utilize the search ability of those news site. For exmple, if you want to search news about the sniper case thru yahoo, you can simply send a HTTP request for URL "http://search.news.yahoo.com/search/news/?c=&p=sniper". Don't do it with brutal force on your side.

[reply]

Re: Writing a news retrieval application.
by Art_XIV (Hermit) on Oct 27, 2003 at 13:53 UTC

LWP or WWW::Mechanize should work just fine for your scraping needs, but here's a hint -

Let one of the HTML:: modules do your parsing for you. You'll be glad you did after your scraped site(s) go through a few layout changes.

[reply]

Re: Writing a news retrieval application.
by chromatic (Archbishop) on Oct 28, 2003 at 01:17 UTC

Perhaps searching RSS feeds would be simpler; they're often provided for similar purposes.

[reply]


Don't ask to ask, just ask
	PerlMonks