Re: Removing text between HTML tags

in reply to Removing text between HTML tags

Of course you can. (Having said that, I've just upvoted the previous comment saying that you can't).

It is in general a very bad idea to try to parse HTML with regexes, I absolutely agree with this, but there are numerous cases where you can still use regexes to get what you want efficiently, as shown with this example under the Perl debugger with the OP's data:

  DB<1> $html = qq( <td class="body3" valign="top"><p style="margin-to
+p:1ex; margin-bottom:1ex;">The purpose of this study is to compare tw
+o types of care - standard <span class="hit_org">oncology</span> care
+ and standard <span class="hit_org">oncology</span> care with early p
+alliative care (started soon after diagnosis) to see which is better 
+for improving the experience of patients and families with advanced l
+ung and non-colorectal GI cancer.  The study will use questionnaires 
+to measure patients' and caregivers' quality of life, mood, coping an
+d understanding of their illness.</p></td>)

  DB<2> $html =~ s/<.+?>//g;
 
  DB<3> print $html
 The purpose of this study is to compare two types of care - standard 
+oncology care and standard oncology care with early palliative care (
+started soon after diagnosis) to see which is better for improving th
+e experience of patients and families with advanced lung and non-colo
+rectal GI cancer.  The study will use questionnaires to measure patie
+nts' and caregivers' quality of life, mood, coping and understanding 
+of their illness.
[download]

That's what you need, isn't it? Anything wrong with the output? Seems OK to me.

So the bottom line is that, yes, you can't really parse HTML (or XHTML or XML, for that matter) with regexes, and that you need a real parser to do it, everyone here pretty much agrees with this, but there are still numerous cases where you can extract data relatively efficiently and reliably from an HTML page with regexes.

No point of being fundamentalist on this. There are many simple cases where you can get useful data from XML, XHTML, HTML, JSON, CSV data with regexes and without having to use the heavy artillery of full-fledged parsers. Agreed, regexes won't work on some complicated HTML or XML structures, but there are so many cases where a proper state-of-the-art DOM or SAX parser just chokes and dies on the first formatting error (and, yes, our world is not perfect, formatting errors do occur) that it is questionable whether they are any better. OK, XML source files are usually machine generated and are hopefully generally bug free (although...), but with HTML content found on the Internet, this is far from being the case.

The number 3 is a poor approximation of pi, but there are a number of cases where it is just efficient enough for your purpose.

When it comes to just remove HTML tags from a HTML file, yes it can often be done with regexes. Admittedly, the very simple regex presented above will not work on every possible piece of HTML, but it does work on the OP's data, doesn't it?

To the OP: the main problem with your regex is that it was greedy, so that it would remove everything from the first "<" to the last ">". The question mark added after the "+" made it non-greedy, meaning that it stopped at the first closing ">" after the first opening "<". The other typical solution is to have this:

$html =~ s/<[^>]+>//g;
[download]

where the [^>] builds a character class containing anything but a closing ">".

I hope that makes your error and its solution clear.

In Section Seekers of Perl Wisdom