Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: More efficient use of HTML::TokeParser::Simple

by GrandFather (Saint)
on Jul 10, 2006 at 20:54 UTC ( [id://560228]=note: print w/replies, xml ) Need Help??


in reply to More efficient use of HTML::TokeParser::Simple

If the HTML you are processing is modest in size then you might consider HTML::TreeBuilder which allows you to search for elements using various match criteria and may clean up code where you want to skip about the document.


DWIM is Perl's answer to Gödel
  • Comment on Re: More efficient use of HTML::TokeParser::Simple

Replies are listed 'Best First'.
Re^2: More efficient use of HTML::TokeParser::Simple
by henka (Novice) on Jul 11, 2006 at 06:17 UTC
    I poked around HTML::TreeBuilder, but my goodness, things are complicated. It may not seem like it to seasoned monks, but to a C programmer, the OO aspects and data structures of perl are, well, daunting. Gleaning how to do something as simple as the one I posted here from the perl module docs is almost always an excercise in frustration.

      Here's a trivial example that seems to do something like what you want and may be enough to get you started with TreeBuilder:

      use warnings; use strict; use HTML::TreeBuilder; my $html = do {local $/; <DATA>}; my $tree = HTML::TreeBuilder->new (); $tree->parse ($html); $tree->eof (); $tree->elementify(); my ($title) = $tree->find ('title'); my @h1 = $tree->find ('h1'); print $title->as_text (), "\n"; print $_->as_text (), "\n" for @h1; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- Took this out for IE6ites "http://www.w3.org/TR/REC-html40/loose. +dtd" --> <html lang="en"> <head> <title>More efficient use of HTML::TokeParser::Simple perlquestion + id:560199</title> </head> <body> <h1>Header 1</h1> <p>First paragraph</p> <h1>Header 2</h1> <p>Second paragraph</p> <h2>Level 2 header 1</h2> </body> </html>

      Prints:

      More efficient use of HTML::TokeParser::Simple perlquestion id:560199 Header 1 Header 2

      DWIM is Perl's answer to Gödel
        What does
        $tree->elementify();
        do here? It appears to run ok if it is commented out. I've often seen it in snippets and have no idea what purpose it serves.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://560228]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (1)
As of 2024-04-25 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found