Re: Removing text between HTML tags

Of course you can. (Having said that, I've just upvoted the previous comment saying that you can't).

It is in general a very bad idea to try to parse HTML with regexes, I absolutely agree with this, but there are numerous cases where you can still use regexes to get what you want efficiently, as shown with this example under the Perl debugger with the OP's data:

  DB<1> $html = qq( <td class="body3" valign="top"><p style="margin-to
+p:1ex; margin-bottom:1ex;">The purpose of this study is to compare tw
+o types of care - standard <span class="hit_org">oncology</span> care
+ and standard <span class="hit_org">oncology</span> care with early p
+alliative care (started soon after diagnosis) to see which is better 
+for improving the experience of patients and families with advanced l
+ung and non-colorectal GI cancer.  The study will use questionnaires 
+to measure patients' and caregivers' quality of life, mood, coping an
+d understanding of their illness.</p></td>)

  DB<2> $html =~ s/<.+?>//g;
 
  DB<3> print $html
 The purpose of this study is to compare two types of care - standard 
+oncology care and standard oncology care with early palliative care (
+started soon after diagnosis) to see which is better for improving th
+e experience of patients and families with advanced lung and non-colo
+rectal GI cancer.  The study will use questionnaires to measure patie
+nts' and caregivers' quality of life, mood, coping and understanding 
+of their illness.
[download]

That's what you need, isn't it? Anything wrong with the output? Seems OK to me.

So the bottom line is that, yes, you can't really parse HTML (or XHTML or XML, for that matter) with regexes, and that you need a real parser to do it, everyone here pretty much agrees with this, but there are still numerous cases where you can extract data relatively efficiently and reliably from an HTML page with regexes.

No point of being fundamentalist on this. There are many simple cases where you can get useful data from XML, XHTML, HTML, JSON, CSV data with regexes and without having to use the heavy artillery of full-fledged parsers. Agreed, regexes won't work on some complicated HTML or XML structures, but there are so many cases where a proper state-of-the-art DOM or SAX parser just chokes and dies on the first formatting error (and, yes, our world is not perfect, formatting errors do occur) that it is questionable whether they are any better. OK, XML source files are usually machine generated and are hopefully generally bug free (although...), but with HTML content found on the Internet, this is far from being the case.

The number 3 is a poor approximation of pi, but there are a number of cases where it is just efficient enough for your purpose.

When it comes to just remove HTML tags from a HTML file, yes it can often be done with regexes. Admittedly, the very simple regex presented above will not work on every possible piece of HTML, but it does work on the OP's data, doesn't it?

To the OP: the main problem with your regex is that it was greedy, so that it would remove everything from the first "<" to the last ">". The question mark added after the "+" made it non-greedy, meaning that it stopped at the first closing ">" after the first opening "<". The other typical solution is to have this:

$html =~ s/<[^>]+>//g;
[download]

where the [^>] builds a character class containing anything but a closing ">".

I hope that makes your error and its solution clear.

Comment on Re: Removing text between HTML tags Select or Download Code

Replies are listed 'Best First'.
Re^2: Removing text between HTML tags by perll (Novice) on Sep 23, 2014 at 10:10 UTC
Thanks, `s/<.+?>//g;` is awesome, it removes all html tags, but I agree will that I should use HTML Parser, as I have parser thousands of URL and it is a big risk to use regex. Also, I found that website generates XML pages so we can parse :) so any XML parser you can suggest? I found XML::Parser and will try that.	[reply] [d/l]
Re^3: Removing text between HTML tags by choroba (Cardinal) on Sep 23, 2014 at 21:19 UTC
I prefer XML::LibXML which can handle HTML as well. XML::Twig is also quite popular. They are both a bit higher level than XML::Parser. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^3: Removing text between HTML tags by Laurent_R (Canon) on Sep 23, 2014 at 17:53 UTC
That's the XML parser that I would have recommended for a start, but I do not use very much XML, and it is usually simple and well-formed XML, so that I don't need anything fancier and did not really try others.	[reply]


No such thing as a small change
	PerlMonks