Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')

by bronto (Priest)
on Oct 15, 2006 at 15:55 UTC ( #578396=perlquestion: print w/replies, xml ) Need Help??

bronto has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks

I am writing a couple of web-page-scraping tools that will help me in my job seek. I already have something working, but what I am missing is a nice pure perl solution that would format a web page to a nice plain text, so that if an announcement is, for any reason, removed, I still have a chance of getting to the contents

And hence the question: is there anything like lynx -dump in Perl? I dug into CPAN for about half an hour and tried html2text, but it didn't really do a good job...

For the few of you that don't know what lynx is and what it does:

NAME lynx - a general purpose distributed information browser for the World Wide Web ... DESCRIPTION Lynx is a fully-featured World Wide Web (WWW) client for users running cursor-addressable, character-cell display devices (e.g., vt100 terminals, vt100 emulators running on Windows 95/NT or Macintoshes, or any other "curses-oriented" display). ... OPTIONS ... -dump dumps the formatted output of the default document or one specified on the command line to standard output. This can be used in the following way: lynx -dump http://www.subir.com/lynx.html

Thanks a lot in advance for your help

Ciao!
--bronto


In theory, there is no difference between theory and practice. In practice, there is.

Replies are listed 'Best First'.
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by grep (Monsignor) on Oct 15, 2006 at 17:23 UTC
    A quick search of cpan gave me HTML::FormatText::WithLinks.

    From the POD:
    DESCRIPTION

    HTML::FormatText::WithLinks takes HTML and turns it into plain text but prints all the links in the HTML as footnotes. By default, it attempts to mimic the format of the lynx text based web browser's --dump option.

    Also please use '<code>' not '<pre>' tags when posting, then preview your post before creating.



    grep
    One dead unjugged rabbit fish later
      A quick search of cpan gave me HTML::FormatText::WithLinks

      ...and THIS was the answer! Thanks grep!

      Ciao!
      --bronto


      In theory, there is no difference between theory and practice. In practice, there is.
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by davidrw (Prior) on Oct 15, 2006 at 16:22 UTC
    WWW::Mechanize has a method for that (it requires that HTML::TreeBuilder is installed as well) ..
    my $mech = WWW::Mechanize->new(); $mech->get('http://example.com'); print $mech->content(format => 'text');
    If you're not already using WWW::Mechanize for your scraping, i highly recommend it (note it uses LWP underneath)..
    Update: added 'print' so that snippet has output
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by davidrw (Prior) on Oct 15, 2006 at 17:39 UTC
    may or may not help with your specific case, but in general HTML::TableExtract can be extremely useful as well
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by spatterson (Pilgrim) on Oct 16, 2006 at 10:16 UTC
    I've had reasonable success with SGML::StripParser though that just rips the tags out, leaving the rest formatted as-is.

    just another cpan module author
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by monarch (Priest) on Oct 16, 2006 at 08:52 UTC
    I tend to do these things by hand, even though I know I really shouldn't.
    my $string = "..htmlstuff.."; # strip out newlines $string =~ s/[\r\n]+/ /sg; # replace <p> with custom paragraph marker my $marker_paragraph = "**PARAGRAPHHERE**"; $string =~ s/<p(\s[^>]*)?>/$marker_paragraph/isg; # remove all HTML tags $string =~ s/<[^>]*>//sg; # replace custom paragraph marker with blank line $string =~ s/\Q$marker_paragraph\E/\n\n/sg;

    You can add other transforms, such as wrapping at a particular column etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://578396]
Approved by grep
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2020-09-22 04:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (128 votes). Check out past polls.

    Notices?