http://qs321.pair.com?node_id=2989

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: (regular expressions)

How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

Originally posted as a Categorized Question.

  • Comment on How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

Replies are listed 'Best First'.
Re: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?
by chromatic (Archbishop) on Apr 03, 2000 at 07:27 UTC
    Unless you're dealing with very simple HTML (either generated by a program or by a beginner), you might discover that these approaches have limited degrees of success. ender's is the best, as it is least greedy.

    For all non-trivial HTML parsing, look to CPAN modules: HTML::Parser and HTML::TokeParser.

Re: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?
by ender (Novice) on Mar 23, 2000 at 01:04 UTC
    If you can get the whole page in one string, then you can use:

    s/<script>.*?<\/script>//igs; Which will eat everything between <script> and </script> tags. (and the <script> and </script> tags as well)
Re: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?
by Pedro Picasso (Sexton) on Oct 15, 2003 at 14:10 UTC
    Let's say you have some html like this:
    <b>I like</b> <i>squirrels!</i>.
    You could use this:
    $html =~ s/<[^>]*>([^<]*)<\/[^>]*>/$1/gs;
    To turn it into this:
    I like squirrels.
    {QandAEditors note: merlyn points out by way of followup that the above regexp only works for simple HTML, and that in real life HTML, the regexp can't be counted upon to not fail. See the followup for details. }
      Sure, that works for simple HTML, but real life HTML can fail on such a simple regex. For example:
      <!-- > this is still the comment --> and some more text
      In that case, "this is still the comment" would be left within the output, when it shouldn't be.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

Re: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?
by songahji (Friar) on May 11, 2005 at 17:58 UTC
    if you have lynx (a program to browse the World Wide Web which works on simple text terminals) then call it.
    $text_only = `lynx -dump $filename`;
    OR

    If you have Netscape, use its "Save as" option with the type set to "Text". This one works with tables.