Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

I want to save web pages as text rather than as HTML.

by anautismobserver (Sexton)
on Sep 06, 2019 at 17:18 UTC ( [id://11105732]=perlquestion: print w/replies, xml ) Need Help??

anautismobserver has asked for the wisdom of the Perl Monks concerning the following question:

My goal is to develop an automated means to obtain the number of followers for each of a list of WordPress blog feeds (eg https://wordpress.com/read/feeds/93815501), which display the number of followers in a way that is saved in a text-only file (such as Firefox's "Save Page as... Text Files") but is not in the HTML page source code.

I'm a Perl novice, and (regrettably) don't currently have the patience or interest to learn Perl "from the ground up". My strategy has been to find working code samples that do pieces of what I want, then change them incrementally until they do all I want. Thanks.

  • Comment on I want to save web pages as text rather than as HTML.

Replies are listed 'Best First'.
Re: I want to save web pages as text rather than as HTML. -- oneliner
by Discipulus (Canon) on Sep 06, 2019 at 19:55 UTC
    Hello anautismobserver and welcome to the monastery and to the wonderful world of perl!

    Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes)

    perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"

    The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text.

    Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt?

    As you presented yourself as a principiant please note that the -M switch of perl import a module as described in perlrun and the concatenations of methods ( ->new_from_url(..)->as_text ) is just a shortcut to avoid unnecessary variable declaration.

    PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Method text in WWW::Mechanize wraps that TreeBuilder code. This is useful to know because often times, one already works with Mechanize or a class derived from it.

      Thanks for all that info. It's a lot to digest.

      Despite the elegance of a one-liner, I prefer to take one step at a time.

      When I try to run the following code:

      use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text;

      I get the error message << Can't locate object method "new_from_url" via package "HTML::TreeBuilder" >>

      What else do I need to add to the code to make it work?

        Maybe you have a really old version and need an update. The method was added 2012-06-12 according to its change file. The example as you posted it works fine for me; relatively current Perl installation with HTML::TB version 5.03 on OS X.

      Now I have Strawberry Perl up and running and the previous TreeBuilder code example now works (using 'http://perl.org' as input).

      When I change the input to 'https://wordpress.com/read/feeds/94271045' using the following code:

      use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('https://wordpress.com/read/feed +s/94271045')->as_text;

      The output is << WordPress.comPlease enable JavaScript in your browser to enjoy WordPress.com. >>

      Do you know how to fix this? One complicating factor is that pages like https://wordpress.com/read/feeds/94271045 won't display properly in my browser unless I'm logged into a WordPress account.

      Thanks.

Re: I want to save web pages as text rather than as HTML.
by jcb (Parson) on Sep 06, 2019 at 23:01 UTC

    Modern Mozilla browsers do not save the page source anymore; they serialize the DOM tree instead. If the information you seek is not in the page source, but does appear when saved, then it is being added to the page using JavaScript. You will need to use the Web Developer tools (Network tab) in Firefox to find the request that loads that data and figure out how to replicate that request and parse the response (probably JSON) in your Perl code.

    Finding the request you need to make is the hard part. Making the request with LWP::UserAgent and parsing the response with JSON should be easy.

      << You will need to use the Web Developer tools (Network tab) in Firefox to find the request that loads that data and figure out how to replicate that request and parse the response (probably JSON) in your Perl code. >>

      Can you give me guidance regarding how to go about this? Or link to somewhere that explains it for novices like me?

      Thanks.

Re: I want to save web pages as text rather than as HTML.
by Anonymous Monk on Sep 06, 2019 at 17:29 UTC

      I apologize for my ignorance.

      When I try to run the first "self-contained example" as written in the provided link, it produces no output (except for "Press any key to continue...").

      When I replace the body of the script (starting with "my $re ="...) with the next code example (that uses "HTML::TokeParser::Simple;"), I get the error message "Can't locate HTML::TokeParser::Simple.pm in @INC"

      Could you please give me a code example that truly is self-contained, that reads (and parses) an HTML file and outputs it as a text file?

      Thank you.

        Could you please give me a code example that truly is self-contained, that reads (and parses) an HTML file and outputs it as a text file?

        Sure, go fish, strip HTML tags

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11105732]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-19 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found