Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Why a regex *really* isn't good enough for HTML, even for "simple" tasks

by hippo (Bishop)
on May 05, 2020 at 14:04 UTC ( [id://11116482]=note: print w/replies, xml ) Need Help??


in reply to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

Just for fun, here's a low-level solution using vanilla HTML::Parser.

use strict; use warnings; use HTML::Parser; my @html = (<<EOT <a href = "http://www.example.com/1" > One </a > <a id="Two" title="href="></a> <!-- <a href="http://www.example.com/3">Three</a> --> <a title=' href="http://www.example.com/4">Four' href="http://www.example.com/5">Five</a> <script> console.log(' <a href="http://www.example.com/6">Six</a> '); /* <!-- */ </script> <a href="http://www.example.com/7">Se<span >v&#101;</span>n</a> <script>/* --> */</script> EOT , <<EOT <a href = "http://www.example.com/1" > One </a > <a id="Two" title="href="></a> <!-- <a href="http://www.example.com/3">Three</a> --> <a title=' href="http://www.example.com/4">Four' href="http://www.example.com/5">Five</a> <script type="text/javascript">/*<![CDATA[ </script> */ console.log(' <a href="http://www.example.com/6">Six</a> '); /* <!-- ]]>*/</script> <a href="http://www.example.com/7"><![CDATA[Se]]><span >v&#101;</span>n</a> <script type="text/javascript">/*<![CDATA[ --> ]]>*/</script> <![CDATA[ <a href="http://www.example.com/8">Eight</a> ]]> EOT ); my $state = 0; my $p = HTML::Parser->new ( api_version => 3, start_h => [ sub { shift eq 'a' or return; my $href = shift->{href} or return; $state = 1; print "$href\t"; shift->handler (text => sub { print trim(shift); }, 'dtext, self'); }, 'tagname, attr, self'], end_h => [ sub { return unless shift eq 'a' && $state; $state = 0; print "\n"; shift->handler (text => ''); }, 'tagname, self'], ); print "HTML:\n"; $p->parse ($html[0]); print "XHTML:\n"; $p->xml_mode (1); $p->marked_sections (1); $p->parse ($html[1]); sub trim { (my $str = shift) =~ s/^\s+|\s+$//g; return $str; }
  • Comment on Re: Why a regex *really* isn't good enough for HTML, even for "simple" tasks
  • Download Code

Replies are listed 'Best First'.
Re^2: Why a regex *really* isn't good enough for HTML, even for "simple" tasks
by haukex (Archbishop) on May 08, 2020 at 07:14 UTC

    Thank you! I've added a slightly modified version to the Gist!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116482]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-18 03:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found