No such thing as a small change | |
PerlMonks |
Re: Removing text between HTML tagsby Laurent_R (Canon) |
on Sep 14, 2014 at 21:44 UTC ( [id://1100542]=note: print w/replies, xml ) | Need Help?? |
Of course you can.
(Having said that, I've just upvoted the previous comment saying that you can't).
It is in general a very bad idea to try to parse HTML with regexes, I absolutely agree with this, but there are numerous cases where you can still use regexes to get what you want efficiently, as shown with this example under the Perl debugger with the OP's data: That's what you need, isn't it? Anything wrong with the output? Seems OK to me. So the bottom line is that, yes, you can't really parse HTML (or XHTML or XML, for that matter) with regexes, and that you need a real parser to do it, everyone here pretty much agrees with this, but there are still numerous cases where you can extract data relatively efficiently and reliably from an HTML page with regexes. No point of being fundamentalist on this. There are many simple cases where you can get useful data from XML, XHTML, HTML, JSON, CSV data with regexes and without having to use the heavy artillery of full-fledged parsers. Agreed, regexes won't work on some complicated HTML or XML structures, but there are so many cases where a proper state-of-the-art DOM or SAX parser just chokes and dies on the first formatting error (and, yes, our world is not perfect, formatting errors do occur) that it is questionable whether they are any better. OK, XML source files are usually machine generated and are hopefully generally bug free (although...), but with HTML content found on the Internet, this is far from being the case. The number 3 is a poor approximation of pi, but there are a number of cases where it is just efficient enough for your purpose. When it comes to just remove HTML tags from a HTML file, yes it can often be done with regexes. Admittedly, the very simple regex presented above will not work on every possible piece of HTML, but it does work on the OP's data, doesn't it? To the OP: the main problem with your regex is that it was greedy, so that it would remove everything from the first "<" to the last ">". The question mark added after the "+" made it non-greedy, meaning that it stopped at the first closing ">" after the first opening "<". The other typical solution is to have this: where the [^>] builds a character class containing anything but a closing ">". I hope that makes your error and its solution clear.
In Section
Seekers of Perl Wisdom
|
|