Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Writing a news retrieval application.

by DigitalKitty (Parson)
on Oct 27, 2003 at 06:18 UTC ( [id://302336]=perlquestion: print w/replies, xml ) Need Help??

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I have considered writing a news retrieval application for a few weeks. Essentially, I envision it has having the following capabilities:

  • When 'fed' a list of keywords, it will search an array of news related sites ( http://news.yahoo.com, http://www.google.com, etc.) for stories that contain the keywords mentioned above.

  • Since the articles are often (if not consistantly) accessible behind a hyperlink, my program needs to be able to follow them. One prime example being http://www.perl.com. It has an article entitled 'A Chromosome at a Time with Perl, Part 1'. The fact that it is behind a link adds another layer of complexity to my objective.

  • After having made it's daily 'sweeps', each complete article would be stored on my local machine in HTML form so I could peruse them at my leisure.

I am considering using WWW::Mechanize but having never used it, I would greatly appreciate advice from those who have or any feedback from those who might suggest an alternate course of action. Included below, is my fledging foray into this vast arena.

use warnings; use strict; use LWP::UserAgent; my $agent = LWP::UserAgent->new(); my $site = 'http://www.perl.com'; my $response = $agent->get( $site ); my $content = $response->content(); if( $content =~ m/Chromosome/i ) { open( FH, ">>news.html" ) || die "Error : $!\n"; print FH $content; close( FH ) || die "Error : $!\n"; } else { print "Nothing!\n"; }


Thanks,
-Katie.

Replies are listed 'Best First'.
Re: Writing a news retrieval application.
by PodMaster (Abbot) on Oct 27, 2003 at 07:03 UTC
    Getting started with WWW::Mechanize is easy (even if WWW::Mechanize::Shell is slightly behind the times)
    C:\new\WWW-Mechanize-Shell-0.29>perl -MWWW::Mechanize::Shell -e shell Module File::Modified not found. Automatic reloading disabled. >get http://perl.com/ Retrieving http://perl.com/(200) http://perl.com/>open /Chromosome/ 83: A Chromosome at a Time with Perl, Part 2 99: A Chromosome at a Time with Perl, Part 1 http://perl.com/>open 83 (200) http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>content bioin +formatics2.html http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>back http://perl.com/>open 99 (200) http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>content bioin +formatics1.html http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>script bionfo +rmatics.pl http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>q C:\new\WWW-Mechanize-Shell-0.29>dir bio* Directory of C:\new\WWW-Mechanize-Shell-0.29 10/26/2003 11:05p 38,616 bioinformatics.html 10/26/2003 11:08p 38,617 bioinformatics1.html 10/26/2003 11:08p 30,658 bioinformatics2.html 10/26/2003 11:09p 721 bionformatics.pl C:\new\WWW-Mechanize-Shell-0.29>cat bionformatics.pl #!C:\Perl\bin\perl.exe -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; my $agent = WWW::Mechanize->new(); my $formfiller = WWW::Mechanize::FormFiller->new(); $agent->env_proxy(); $agent->get('http://perl.com/'); $agent->form(1) if $agent->forms and scalar @{$agent->forms}; $agent->follow('83'); { my $filename = q{bioinformatics2.html}; local *F; open F, "> $filename" or die "$filename: $!"; binmode F; print F $agent->content,"\n"; close F }; $agent->back(); $agent->follow('99'); { my $filename = q{bioinformatics1.html}; local *F; open F, "> $filename" or die "$filename: $!"; binmode F; print F $agent->content,"\n"; close F }; C:\new\WWW-Mechanize-Shell-0.29>
    This is like the first time i've used these (I have before, but since I don't remember anything, it's like I haven't).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Writing a news retrieval application.
by pg (Canon) on Oct 27, 2003 at 06:54 UTC

    You can simply utilize the search ability of those news site. For exmple, if you want to search news about the sniper case thru yahoo, you can simply send a HTTP request for URL "http://search.news.yahoo.com/search/news/?c=&p=sniper". Don't do it with brutal force on your side.

Re: Writing a news retrieval application.
by Art_XIV (Hermit) on Oct 27, 2003 at 13:53 UTC

    LWP or WWW::Mechanize should work just fine for your scraping needs, but here's a hint -

    Let one of the HTML:: modules do your parsing for you. You'll be glad you did after your scraped site(s) go through a few layout changes.

Re: Writing a news retrieval application.
by chromatic (Archbishop) on Oct 28, 2003 at 01:17 UTC

    Perhaps searching RSS feeds would be simpler; they're often provided for similar purposes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://302336]
Approved by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (1)
As of 2024-04-25 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found