Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Parsing HTML

by Gorby (Monk)
on Jun 13, 2004 at 15:24 UTC ( [id://363875]=perlquestion: print w/replies, xml ) Need Help??

Gorby has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Monks,

I'm getting webpages using the get() function under LWP:Simple. My problem is that I need a module to take out all the html tags so that I end up with plain formatted text.

Is there a module that does this?

Thanks in advance.

Gorby

Replies are listed 'Best First'.
Re: Parsing HTML
by davido (Cardinal) on Jun 13, 2004 at 16:12 UTC
    HTML is the markup language used to format the text. Taking it out, you can approximate some things (paragraph breaks, for example), but you can't have bold, italics, tables, etc. ...not in an OS independant way, at least. If formatting isn't an absolute requirement, read on...

    As far as removing all HTML, I happen to like HTML::Strip. It's easy to use, and results in pretty readable output. It has a habbit of indenting a lot, but that's easy to strip out too if you want. Here's the synopsis from its POD:

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;

    $clean_text now will contain the HTML-free version of $raw_html. It's as easy to use as LWP::Simple.


    Dave

Re: Parsing HTML
by atcroft (Abbot) on Jun 13, 2004 at 16:06 UTC

    One possible way, shown in Perl CookBook (1st ed.), problem 20.6, is:

    use HTML::Parse; use HTML::FormatText; $plain_text = HTML::FormatText->new->format(parse_html($html_text));

    (I thought there was a function in LWP::Simple that would just return the text, but I can't recall for sure.)

    Hope this helps.

Re: Parsing HTML
by thunders (Priest) on Jun 14, 2004 at 06:05 UTC
    Many of the modules that inherit from HTML::Parser are quite capable for this kind of task. I like HTML::TokeParser::Simple.
    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $site = $ARGV[0]; my $content = get($site); my $parser = HTML::TokeParser::Simple->new(\$content); while ( my $token = $parser->get_token ) { next unless $token->is_text; print $token->as_is; }
Re: Parsing HTML
by injunjoel (Priest) on Jun 14, 2004 at 00:09 UTC
    Greetings all,
    I know there are modules out there to do this but this regexp has yet to let me down for stripping HTML style tags.
    $strContainingHTML =~ s|<[^>]*>||sg;


    ...and you should check some of them out :}
    -injunjoel
    Update:
    Thanks dragonchild for pointing out the error of my assumptive ways. I guess I've usually (not always) deferred javascript logic to functions and let the event handlers simply trigger the call to them. A case where personal style has saved me many potential headaches. Good catch.
      Sorry, but I couldn't resist.
      use strict; use warnings; my $x = "<html onLoad='if (x > 5) { exit; }' />"; print "'$x'\n"; $x =~ s|</?[^>]*>||sg; print "'$x'\n"; ----- '<html onFocus='if (x > 5) { exit; };' />' ' 5) { exit; };' />'

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://363875]
Approved by integral
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found