Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Remove HTML tags from document

by matth (Monk)
on Aug 03, 2003 at 18:09 UTC ( #280476=perlquestion: print w/replies, xml ) Need Help??

matth has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

What is now regarded as the best way to remove all tags from a HTML document? I have briefly tried to work will HTML::Parser but I don't understand it all that well.

20030803 Edit by jeffa: Changed title from 'HTML tags '

Replies are listed 'Best First'.
Re: Remove HTML tags from document
by pzbagel (Chaplain) on Aug 03, 2003 at 18:25 UTC

    You could use HTML::TokeParser::Simple and only print text tags.

    #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; }

    HTH

      This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?

        I'm not sure I understand. I recall that HTML::TokeParser::Simple does in fact maintain newlines in the text. I tested the code quickly just to make sure and it does maintain newlines in the html. Do you have tags that are multi-line? What exactly is happening?

Re: Remove HTML tags from document
by fglock (Vicar) on Aug 03, 2003 at 21:35 UTC

    HTML::Strip - Perl extension for stripping HTML markup from text.

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;
Re: Remove HTML tags from document
by Juerd (Abbot) on Aug 04, 2003 at 09:26 UTC

    RTFM.

    perldoc -q 'remove html'

    How do I remove HTML from a string?

    The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

    Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like &lt; for example.

    Here's one "simple-minded" approach, that works for most files:

    #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz.

    Also, with Super Search or Google, you can find hundreds of answers.

    See also How (Not) To Ask A Question.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: Remove HTML tags from document
by ido50 (Scribe) on Aug 03, 2003 at 20:37 UTC
    If you want a good module with good documentation, I suggest you try HTML::TokeParser. oreilly.com's got a free full chapter from "Perl&LWP" which deals with this module exclusively. You can find it on http://www.oreilly.com/catalog/perllwp/ in a nice pdf document.

    ------------------------
    Live fat, die young
Re: Remove HTML tags from document
by trs80 (Priest) on Aug 03, 2003 at 20:09 UTC
    You might want to try w3m, it preserves formating of tables in plain text fairly well as well. It't not Perl, but it works :)
      This is an old package. Is it really any good?
        I use this package to convert my HTML reports into text so they can emailed to users that don't support HTML in their email client. It works well with the content I deal with. I don't feel value of a package should be derived from its age if it solves the problem at hand.
Re: Remove HTML tags from document
by LazerRed (Pilgrim) on Aug 03, 2003 at 22:12 UTC
    Here's something I've been playing with lately. Maybe it'll help you.

    sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; }

    I use this sub in a script that checks a status page on many different servers. It feeds the raw stats pages through the above sub, then parses the output text to generate a consolodated status report.

    Whip me, Beat me, Make me use Y-ModemG.
Re: Remove HTML tags from document
by daeve (Deacon) on Aug 04, 2003 at 03:52 UTC
    And in the spirit of TIMTOWTDI...

    If you just need to strip all the html tags from a page, and are on a platform with lynx, you can use:

    #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text";

    HTH
    Daeve

      How can I get this to print out to a file instead of the STDOUT? I have very large HTML files.
        perldoc -f open perldoc -f print perldoc perlopentut

        Abigail

        A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://280476]
Approved by blue_cowdawg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2020-07-07 09:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?