I want to save web pages as text rather than as HTML.

anautismobserver has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: I want to save web pages as text rather than as HTML. -- oneliner by Discipulus (Canon) on Sep 06, 2019 at 19:55 UTC
Hello anautismobserver and welcome to the monastery and to the wonderful world of perl! Perl is powerful enough to achieve this with a oneliner (pay attention to windows doublequotes) `perl -MHTML::TreeBuilder -e "print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text"` The above combines two steps: getting the raw html content from the url (using LWP::UserAgent under the hood) and formatting the output as text. Web scraping is a dark art and could be achieved in many distinct ways. You can follow some link in my bibliotheca: web scraping or visit previous threads like Re: How can I download HTML and save it as txt? As you presented yourself as a principiant please note that the `-M` switch of perl import a module as described in `perlrun` and the concatenations of methods ( `->new_from_url(..)->as_text` ) is just a shortcut to avoid unnecessary variable declaration. PS you can also use other modules to do the web scrape part as suggested by Task::Kensho that is a fairly good collection of modules from CPAN. Also other modules are worth to try like Mojo::Dom or Web::Scraper as suggested in The State of Web spidering in Perl L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: I want to save web pages as text rather than as HTML. -- oneliner by daxim (Curate) on Sep 09, 2019 at 10:46 UTC
Method `text` in WWW::Mechanize wraps that TreeBuilder code. This is useful to know because often times, one already works with Mechanize or a class derived from it.	[reply] [d/l]
Re^2: I want to save web pages as text rather than as HTML. -- oneliner by anautismobserver (Sexton) on Sep 10, 2019 at 21:18 UTC
Thanks for all that info. It's a lot to digest. Despite the elegance of a one-liner, I prefer to take one step at a time. When I try to run the following code: `use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('http://perl.org')->as_text;` [download] I get the error message << Can't locate object method "new_from_url" via package "HTML::TreeBuilder" >> What else do I need to add to the code to make it work?	[reply] [d/l]
Re^3: I want to save web pages as text rather than as HTML. -- oneliner by Your Mother (Archbishop) on Sep 10, 2019 at 21:25 UTC
Maybe you have a really old version and need an update. The method was added 2012-06-12 according to its change file. The example as you posted it works fine for me; relatively current Perl installation with HTML::TB version 5.03 on OS X.	[reply]
Re^4: I want to save web pages as text rather than as HTML. -- oneliner by anautismobserver (Sexton) on Sep 11, 2019 at 01:05 UTC
Re^5: I want to save web pages as text rather than as HTML. -- oneliner by Your Mother (Archbishop) on Sep 11, 2019 at 06:59 UTC
Re^5: I want to save web pages as text rather than as HTML. -- oneliner by Anonymous Monk on Sep 11, 2019 at 17:25 UTC
Some notes below your chosen depth have not been shown here
Re^2: I want to save web pages as text rather than as HTML. -- oneliner by anautismobserver (Sexton) on Sep 11, 2019 at 19:18 UTC
Now I have Strawberry Perl up and running and the previous TreeBuilder code example now works (using 'http://perl.org' as input). When I change the input to 'https://wordpress.com/read/feeds/94271045' using the following code: `use strict; use warnings; use LWP::UserAgent; use LWP::Simple; use HTML::TreeBuilder; print HTML::TreeBuilder->new_from_url('https://wordpress.com/read/feed +s/94271045')->as_text;` [download] The output is << WordPress.comPlease enable JavaScript in your browser to enjoy WordPress.com. >> Do you know how to fix this? One complicating factor is that pages like https://wordpress.com/read/feeds/94271045 won't display properly in my browser unless I'm logged into a WordPress account. Thanks.	[reply] [d/l]
Re^3: I want to save web pages as text rather than as HTML. -- oneliner by Anonymous Monk on Sep 23, 2019 at 06:41 UTC
Like https://metacpan.org/pod/WWW::Mechanize::FAQ#I-have-this-web-page-that-has-JavaScript-on-it,-and-my-Mech-program-doesn%27t-work. says, use a browser agent that supports javascript, like WWW::Mechanize::Chrome, WWW::Mechanize::Firefox, WWW::Mechanize::PhantomJS	[reply]
Re^4: I want to save web pages as text rather than as HTML. -- oneliner by marto (Cardinal) on Sep 23, 2019 at 08:34 UTC
Re: I want to save web pages as text rather than as HTML. by jcb (Parson) on Sep 06, 2019 at 23:01 UTC
Modern Mozilla browsers do not save the page source anymore; they serialize the DOM tree instead. If the information you seek is not in the page source, but does appear when saved, then it is being added to the page using JavaScript. You will need to use the Web Developer tools (Network tab) in Firefox to find the request that loads that data and figure out how to replicate that request and parse the response (probably JSON) in your Perl code. Finding the request you need to make is the hard part. Making the request with LWP::UserAgent and parsing the response with JSON should be easy.	[reply]
Re^2: I want to save web pages as text rather than as HTML. by anautismobserver (Sexton) on Sep 12, 2019 at 02:33 UTC
<< You will need to use the Web Developer tools (Network tab) in Firefox to find the request that loads that data and figure out how to replicate that request and parse the response (probably JSON) in your Perl code. >> Can you give me guidance regarding how to go about this? Or link to somewhere that explains it for novices like me? Thanks.	[reply]
Re^3: I want to save web pages as text rather than as HTML. by marto (Cardinal) on Sep 12, 2019 at 06:41 UTC
https://developer.mozilla.org/en-US/docs/Tools/Web_Console, https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor.	[reply]
Re: I want to save web pages as text rather than as HTML. by Anonymous Monk on Sep 06, 2019 at 17:29 UTC
You need a true HTML parser to help you get to the particular nodes that you want. Very good write-up here: http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html	[reply]
Re^2: I want to save web pages as text rather than as HTML. by anautismobserver (Sexton) on Sep 06, 2019 at 22:14 UTC
I apologize for my ignorance. When I try to run the first "self-contained example" as written in the provided link, it produces no output (except for "Press any key to continue..."). When I replace the body of the script (starting with "my $re ="...) with the next code example (that uses "HTML::TokeParser::Simple;"), I get the error message "Can't locate HTML::TokeParser::Simple.pm in @INC" Could you please give me a code example that truly is self-contained, that reads (and parses) an HTML file and outputs it as a text file? Thank you.	[reply]
Re^3: I want to save web pages as text rather than as HTML. by Anonymous Monk on Sep 07, 2019 at 01:30 UTC
Could you please give me a code example that truly is self-contained, that reads (and parses) an HTML file and outputs it as a text file? Sure, go fish, strip HTML tags	[reply]


Just another Perl shrine
	PerlMonks