Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

save a page as text

by Anonymous Monk
on Apr 22, 2005 at 00:17 UTC ( [id://450236]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I asked a SOPW yesterday about asking how to screen scrape (not hack at the source code itself) and one person suggested to save the page as text from the browser.

This idea might just work and I was wondering if anyone had any experience with such things with Perl/CGI? I really want Perl to do the entire process rather than having to open the browser and save it myself.

I use IE and I assume this is probably going to be browser-dependant of the server it's running on.

Thanks.

Replies are listed 'Best First'.
Re: save a page as text
by Hero Zzyzzx (Curate) on Apr 22, 2005 at 00:43 UTC

    No need to involve a browser at all. Here's one way, using the excellent HTML::TokeParser::Simple by the monastery's own Ovid.

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $page=get('http://www.page.you.want.com/some/path'); my $p = HTML::TokeParser::Simple->new( \$page ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }

    Stuffing it into a file is left as an exercise for the poster.

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

    My Biz

      I can't just strip the HTML, if I could I know how to do that myself. There is JavaScript in the code that prints something out and I need to retrieve what this is.

      I can't retrieve the source code becuse it's just the JS code there, not the data it prints. So I need a way to make a perl screen scraper to scrape text from a page without introducing HTML codes to any degree.

        Javascript has been a problem for page scraping. People have tried to go around it by, say, recording the actual http parameters, which is not relevant to your problem. The other approach is to drive IE using Win32::OLE. I used Win32::IE::Mechanize before, but it's mainly for navigation/parsing, you/someone needs to figure out how to call the "Save As" method from COM.

        I didn't know "Save As Text" will evaluate javascript printing. I tried it out, apparently it works.

        Updated. just saw the module Win32::CaptureIE, it looks more promising.

        Sorry, I was a bit confused by the question. I do very little on or for windows, so hopefully someone more experienced with automating the evil empire will speak up.

        -Any sufficiently advanced technology is
        indistinguishable from doubletalk.

        My Biz

Re: save a page as text
by jpeg (Chaplain) on Apr 22, 2005 at 00:35 UTC
    I may be misunderstanding you, but a .cgi script written in perl doesn't look different than any other html, asp, or php page. It just spits out html as far as the client is concerned, so you'll still be using LWP or WWW::Mechanize or whatever. You'll still need to strip out the html tags.

    If you're asking how to script the actions of opening IE and clicking the file menu and so on.... Maybe look into Win32::OLE or the ActiveState Win32 mailing lists.
    --
    jpg

Re: save a page as text
by Thelonious (Scribe) on Apr 22, 2005 at 11:55 UTC
    You can use Win32::IE::Mechanize. It will drive IE:

    use strict; use warnings FATAL => 'all'; use Win32::IE::Mechanize; my $ie = Win32::IE::Mechanize->new(visible => 1); $ie->get('http://www.perlmonks.com/'); $ie->follow_link(text => 'Seekers of Perl Wisdom'); $ie->follow_link(text => 'save a page as text'); print $ie->content;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://450236]
Approved by moot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (1)
As of 2024-04-18 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found