Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

How do I remove HTML from a string?

by faq_monk (Initiate)
on Oct 08, 1999 at 00:32 UTC ( [id://758]=perlfaq nodetype: print w/replies, xml ) Need Help??

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like &lt; for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

    <IMG SRC = "foo.gif" ALT = "A > B">

    <IMG SRC = "foo.gif" 
         ALT = "A > B">

    <!-- <A comment> -->

    <script>if (a<b && a>c)</script>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

    <!-- This section commented out.
        <B>You can't see me!</B>
    -->

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-19 04:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found