Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Of course you can. (Having said that, I've just upvoted the previous comment saying that you can't).

It is in general a very bad idea to try to parse HTML with regexes, I absolutely agree with this, but there are numerous cases where you can still use regexes to get what you want efficiently, as shown with this example under the Perl debugger with the OP's data:

DB<1> $html = qq( <td class="body3" valign="top"><p style="margin-to +p:1ex; margin-bottom:1ex;">The purpose of this study is to compare tw +o types of care - standard <span class="hit_org">oncology</span> care + and standard <span class="hit_org">oncology</span> care with early p +alliative care (started soon after diagnosis) to see which is better +for improving the experience of patients and families with advanced l +ung and non-colorectal GI cancer. The study will use questionnaires +to measure patients' and caregivers' quality of life, mood, coping an +d understanding of their illness.</p></td>) DB<2> $html =~ s/<.+?>//g; DB<3> print $html The purpose of this study is to compare two types of care - standard +oncology care and standard oncology care with early palliative care ( +started soon after diagnosis) to see which is better for improving th +e experience of patients and families with advanced lung and non-colo +rectal GI cancer. The study will use questionnaires to measure patie +nts' and caregivers' quality of life, mood, coping and understanding +of their illness.
That's what you need, isn't it? Anything wrong with the output? Seems OK to me.

So the bottom line is that, yes, you can't really parse HTML (or XHTML or XML, for that matter) with regexes, and that you need a real parser to do it, everyone here pretty much agrees with this, but there are still numerous cases where you can extract data relatively efficiently and reliably from an HTML page with regexes.

No point of being fundamentalist on this. There are many simple cases where you can get useful data from XML, XHTML, HTML, JSON, CSV data with regexes and without having to use the heavy artillery of full-fledged parsers. Agreed, regexes won't work on some complicated HTML or XML structures, but there are so many cases where a proper state-of-the-art DOM or SAX parser just chokes and dies on the first formatting error (and, yes, our world is not perfect, formatting errors do occur) that it is questionable whether they are any better. OK, XML source files are usually machine generated and are hopefully generally bug free (although...), but with HTML content found on the Internet, this is far from being the case.

The number 3 is a poor approximation of pi, but there are a number of cases where it is just efficient enough for your purpose.

When it comes to just remove HTML tags from a HTML file, yes it can often be done with regexes. Admittedly, the very simple regex presented above will not work on every possible piece of HTML, but it does work on the OP's data, doesn't it?

To the OP: the main problem with your regex is that it was greedy, so that it would remove everything from the first "<" to the last ">". The question mark added after the "+" made it non-greedy, meaning that it stopped at the first closing ">" after the first opening "<". The other typical solution is to have this:

$html =~ s/<[^>]+>//g;
where the [^>] builds a character class containing anything but a closing ">".

I hope that makes your error and its solution clear.

In reply to Re: Removing text between HTML tags by Laurent_R
in thread Removing text between HTML tags by perll

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (6)
    As of 2021-01-27 17:25 GMT
    Find Nodes?
      Voting Booth?