Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Removing text between HTML tags

by perll (Novice)
on Sep 14, 2014 at 14:14 UTC ( #1100529=perlquestion: print w/replies, xml ) Need Help??

perll has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to parse HTML data using regex, below is the HTML code
<td class="body3" valign="top"><p style="margin-top:1ex; margin-botto +m:1ex;">The purpose of this study is to compare two types of care - s +tandard <span class="hit_org">oncology</span> care and standard <span + class="hit_org">oncology</span> care with early palliative care (sta +rted soon after diagnosis) to see which is better for improving the e +xperience of patients and families with advanced lung and non-colorec +tal GI cancer. The study will use questionnaires to measure patients +' and caregivers' quality of life, mood, coping and understanding of +their illness.</p></td>
I tried to extract the text using below code. ($bs) = $pre_bs =~ m/\>(.*)\</; Information of only 1st tag will be removed, not all. So I tried with this as well,  $bt =~ s/<.*>//gi; but its not working, everything is removed in this case. I want to remove all tags in a line no matter how many are they, tried multiple combinations but nothing is working. Thanks

Replies are listed 'Best First'.
Re: Removing text between HTML tags
by Utilitarian (Vicar) on Sep 14, 2014 at 14:32 UTC
    The substitution below works for the sample provided, however this is the wrong way to do it, I've assumed that this is very simple HTML with nothing that would break a very simple minded substitution. (eg, what would happen when a button with the alternative text "Next >" ) There is a famous response to this on another site, but the Perl specific response is to use a HTML parsing module eg HTML::TokeParser::Simple which helpfully has extracting the content from a html file as the first example.
    s/<[^>]+>//g;

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
      Thanks, I know about HTML::TokeParser::Simple, but I am working on my office laptop and firewall blocks cpan :( It will take time for me to get that module. Anyway it is a known set of HTML and will be same for all pages, thank you.
        Is Metacpan blocked?

        print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

        “Without any further ado,” talk with your boss and ask him or her to arrange for you to have access to CPAN.   (You can, if necessary, install all of the modules that you need locally to just your own account and machine, so there are no system-integrity risks.)   There is zero doubt in my mind that there is really no other business-justifiable way to get this job done.   (And there undoubtedly will be more business-cases like this one.   You must have the Right Tools For The Job.)

Re: Removing text between HTML tags
by Grimy (Pilgrim) on Sep 14, 2014 at 16:27 UTC

    You can’t.

    More seriously, you can get closer to what you were trying to achieve using reluctant quantifiers: s/<(.*?)>/g;. This will still fail in some corner cases, so you’re better off using a full-blown HTML parser.

      In general, you can't reliably parse entire HTML or XML documents with a single regex.

      You CAN, however, parse individual bits of wellformed XML out of a string using a regex. For example, you can have a regex that correctly matches a single empty element (though it won't parse individual attributes, just recognize that there are zero or more of them.) Or you can have a regex that will match an element. Or you can have a regex that matches a non-empty element that contains only text.

      By combining several such regular expressions in a small number of simple functions, some of which are directly or indirectly recursive, I believe it is possible to parse well-formed XHTML in a couple of screens full of reasonably maintainable Perl code. If I'm wrong, it would be because XML allows something I'm not aware of and never use (e.g., if there were some kind of non-trivial quoting mechanism for embedding non-entity-ized quotation marks in attribute values, that could really gum up the works).

      Thanks for suggestion:  s/<(.*)>/g; works like a charm, however I agree with all that it should be parse with HTML parsers so I am re-writing the code using HTML::TokeParser, hopefully that goes well :)
Re: Removing text between HTML tags
by Laurent_R (Canon) on Sep 14, 2014 at 21:44 UTC
    Of course you can. (Having said that, I've just upvoted the previous comment saying that you can't).

    It is in general a very bad idea to try to parse HTML with regexes, I absolutely agree with this, but there are numerous cases where you can still use regexes to get what you want efficiently, as shown with this example under the Perl debugger with the OP's data:

    DB<1> $html = qq( <td class="body3" valign="top"><p style="margin-to +p:1ex; margin-bottom:1ex;">The purpose of this study is to compare tw +o types of care - standard <span class="hit_org">oncology</span> care + and standard <span class="hit_org">oncology</span> care with early p +alliative care (started soon after diagnosis) to see which is better +for improving the experience of patients and families with advanced l +ung and non-colorectal GI cancer. The study will use questionnaires +to measure patients' and caregivers' quality of life, mood, coping an +d understanding of their illness.</p></td>) DB<2> $html =~ s/<.+?>//g; DB<3> print $html The purpose of this study is to compare two types of care - standard +oncology care and standard oncology care with early palliative care ( +started soon after diagnosis) to see which is better for improving th +e experience of patients and families with advanced lung and non-colo +rectal GI cancer. The study will use questionnaires to measure patie +nts' and caregivers' quality of life, mood, coping and understanding +of their illness.
    That's what you need, isn't it? Anything wrong with the output? Seems OK to me.

    So the bottom line is that, yes, you can't really parse HTML (or XHTML or XML, for that matter) with regexes, and that you need a real parser to do it, everyone here pretty much agrees with this, but there are still numerous cases where you can extract data relatively efficiently and reliably from an HTML page with regexes.

    No point of being fundamentalist on this. There are many simple cases where you can get useful data from XML, XHTML, HTML, JSON, CSV data with regexes and without having to use the heavy artillery of full-fledged parsers. Agreed, regexes won't work on some complicated HTML or XML structures, but there are so many cases where a proper state-of-the-art DOM or SAX parser just chokes and dies on the first formatting error (and, yes, our world is not perfect, formatting errors do occur) that it is questionable whether they are any better. OK, XML source files are usually machine generated and are hopefully generally bug free (although...), but with HTML content found on the Internet, this is far from being the case.

    The number 3 is a poor approximation of pi, but there are a number of cases where it is just efficient enough for your purpose.

    When it comes to just remove HTML tags from a HTML file, yes it can often be done with regexes. Admittedly, the very simple regex presented above will not work on every possible piece of HTML, but it does work on the OP's data, doesn't it?

    To the OP: the main problem with your regex is that it was greedy, so that it would remove everything from the first "<" to the last ">". The question mark added after the "+" made it non-greedy, meaning that it stopped at the first closing ">" after the first opening "<". The other typical solution is to have this:

    $html =~ s/<[^>]+>//g;
    where the [^>] builds a character class containing anything but a closing ">".

    I hope that makes your error and its solution clear.

      Thanks,  s/<.+?>//g; is awesome, it removes all html tags, but I agree will that I should use HTML Parser, as I have parser thousands of URL and it is a big risk to use regex. Also, I found that website generates XML pages so we can parse :) so any XML parser you can suggest? I found XML::Parser and will try that.
        I prefer XML::LibXML which can handle HTML as well. XML::Twig is also quite popular. They are both a bit higher level than XML::Parser.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        That's the XML parser that I would have recommended for a start, but I do not use very much XML, and it is usually simple and well-formed XML, so that I don't need anything fancier and did not really try others.
Re: Removing text between HTML tags
by choroba (Archbishop) on Sep 15, 2014 at 09:57 UTC
    Use a proper HTML handling tool. XML::LibXML can handle HTML if it follows the standard decently. I usually use its wrapper XML::XSH2, in which you can extract the text simply as
    open :F html input.html ; for //p echo .//text() ;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1100529]
Approved by Athanasius
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2020-11-25 00:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?