Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

TL;DR: Working code below!

Say you "just" want to extract some links. Are you sure the HTML's formatting will never change (whitespace, order of attributes, its structure, and so on)? Well, here's some perfectly valid HTML - good luck!

This is valid HTML5:

<a href = "http://www.example.com/1" > One </a > <a id="Two" title="href="></a> <!-- <a href="http://www.example.com/3">Three</a> --> <a title=' href="http://www.example.com/4">Four' href="http://www.example.com/5">Five</a> <script> console.log(' <a href="http://www.example.com/6">Six</a> '); /* <!-- */ </script> <a href=http://www.example.com/7>Se<span >v&#101;</span>n</a> <script>/* --> */</script>

In addition, replace everything starting with the first <script> tag with this, and you've got valid XHTML - in other words, valid XML as well:

<script type="text/javascript">/*<![CDATA[ </script> */ console.log(' <a href="http://www.example.com/6">Six</a> '); /* <!-- ]]>*/</script> <a href="http://www.example.com/7"><![CDATA[Se]]><span >v&#101;</span>n</a> <script type="text/javascript">/*<![CDATA[ --> ]]>*/</script> <![CDATA[ <a href="http://www.example.com/8">Eight</a> ]]>

(There's only three links, "One", "Five", and "Seven".)

Solutions that work on all of the above:

Mojo::DOM (my personal favorite):

use Mojo::DOM; my $links = Mojo::DOM->new($html)->find('a[href]'); for my $link (@$links) { ( my $txt_trim = $link->all_text ) =~ s/^\s+|\s+$//g; print $link->{href}, "\t", $txt_trim, "\n"; }

(Note you can use Mojo::Collection methods instead of the for loop if you like. And on Perl 5.14 and above, the code in the for loop can be simplified to: print $link->{href}, "\t", $link->all_text =~ s/^\s+|\s+$//gr, "\n";. Use Mojo::DOM->new->xml(1)->parse($xml) to use this module to parse XML, including XHTML.)

HTML::TreeBuilder::XPath (a bit older, but still works):

use HTML::TreeBuilder::XPath; my $p = HTML::TreeBuilder::XPath->new; $p->marked_sections(1); $p->xml_mode(1); # DEPENDING ON INPUT my @links = $p->parse($html)->findnodes('//a[@href]'); for my $link (@links) { print $link->attr('href'), "\t", $link->as_text_trimmed, "\n"; }

HTML::LinkExtor (a well-established module, based on HTML::Parser like the previous solution; only gets link attributes, no text content):

use HTML::LinkExtor; my $p = HTML::LinkExtor->new; $p->marked_sections(1); $p->xml_mode(1); # DEPENDING ON INPUT my @links = $p->parse($html)->links; for my $link (@links) { my ($tag, %attrs) = @$link; print $attrs{href}, "\n"; }

(Note: for the previous two solutions, you might be tempted to do $p->xml_mode( $html=~/^\s*<\?xml/ );, but note that this isn't completely reliable - some XML documents may not have an XML processing instruction, and this regex is very simplistic. It's much more reliable if you know your inputs.)

For even more potential solutions, see the thread Parsing HTML/XML with Regular Expressions. For example, XHTML can be parsed with XML::LibXML.

All of the above code (and more!) is also available as a Gist: https://gist.github.com/haukex/fd76efa16f0b07ce6a7441d9b2265b2a

Update 2020-05-28: Edited title to reflect that the XHTML example is just as much about XML as HTML.


In reply to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-03-29 04:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found