Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Stripping of HTML content

by Molt (Chaplain)
on Sep 12, 2002 at 16:10 UTC ( [id://197261]=note: print w/replies, xml ) Need Help??


in reply to Stripping of HTML content

Look at a HTML Parser, either HTML::Parser itself or (even more simply) HTML::TokeParser. Your regexps are fragile, they will break.. try feeding <img src="this.gif" alt="<<THIS>>"> into it, watch it fall over screaming.

If you want a nice full description of HTML parsing, if this is going to be something you're doing a lot of, then peer into 'Perl and LWP' by O'Reilly.

Replies are listed 'Best First'.
Re: Re: Stripping of HTML content
by Nemp (Pilgrim) on Sep 12, 2002 at 16:18 UTC
    Hi Molt,

    Thanks for the reply but as I stated in my first post I don't really mind that your line of code would leave me with >"> in my output right now - as long as there are no valid tags left that could alter formatting, run scripts etc. - I'm working on learning this from the ground up :)

    But the book sounds good - I'll look into it for future reference :)

    Thanks!,
    Neil
      Depending on how much inaccuracy you can tolerate, you can get a reasonable facsimile of stripping all HTML by doing:
      $page =~ s/<[^<>]*>//g; # Note the added < inside []
      assuming the entire page content is in $page. A line by line approach like that in your original post will fail on tags that span multiple lines. The regexp above will break if you have unbalanced < or > inside of html tags, but may be good enough for your use.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://197261]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-19 10:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found