HTML::Tree(Builder) in 6 minutes

There are numerous posts regarding parsing HTML and many seem to skip over HTML::Tree(Builder), due in part to its name I believe. This is a lightening fast intro to HTML::Tree and what it can (and can't) do for you.

The "tree" is a way to represent the flow of data in a semi structured markup language such as HTML. A trees validity is directly related to the quality of the HTML, that is bad markup will get you a bad tree. It can overcome some issues, but there are several it can not. So if you have a problem with the results, validate the source HTML before you curse HTML::Tree.

HTML::Tree inherits from a couple of other modules, most notably HTML::Element. As HTML::Tree parses your content it converts each of the tags into HTML::Element objects. So when you work with an individual tag you are working with an HTML::Element object stored in your tree. Read the docs for HTML::Element if you really want to find the strength of HTML::Tree.

This sample script uses LWP to retrieve the content of a page to build our "tree" from. You can also call in content from a file, see docs for more info.

use strict;
use HTML::Tree;
use LWP::Simple;

my $funky = "http://www.google.com";

my $content = get($funky);

my $tree = HTML::Tree->new();

$tree->parse($content);

print $tree->as_text;
[download]

The as_text method is inherited from the HTML::Element module. There is an as_HTML method as well. These methods, when used on the entire tree, simple walk down the tree and expand each HTML::Element object into either the text it contains (as_text) or the HTML code it represents (as_HTML).

Lets do another quick run through to show how we get what we want (a single tag in this case) out of the page.


use strict;
use HTML::Tree;
use LWP::Simple;

my $funky = "http://www.google.com";

my $content = get($funky);

my $tree = HTML::Tree->new();

$tree->parse($content);

my ($title) = $tree->look_down( '_tag' , 'title' );

print $title->as_text , "\n";
print $title->as_HTML , "\n";
[download]

The '_tag' tells HTML::Tree's look_down method what 'key' to look at and the title is the value that 'key' should have. Title could be 'a' for anchor or 'img' for image, etc. If you want to capture all of a particular tags for the page you would simple use an array instead of a scalar to collect them, such as:

my @a_tags = $tree->look_down( '_tag' , 'a' );
[download]

Beyond this intro I recommend the documentation and the article the author of HTML::Tree has in The Perl Journal.

One last caveat, use HTML::Tree if you want to parse HTML not create it, if you want to create HTML use CGI or HTML::Element (or other) by itself.

I hope you enjoy HTML::Tree.

UPDATE: added readmore tags

Comment on HTML::Tree(Builder) in 6 minutes Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Tree(Builder) in 6 minutes by jeffa (Bishop) on Aug 03, 2003 at 17:52 UTC
Excellent, but ... "... if you want to create HTML use CGI or HTML::Element (or other) ..." cough HTML::Template hermph ahem Template cough Sorry, but someone had to mention them. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Re: HTML::Tree(Builder) in 6 minutes by trs80 (Priest) on Aug 03, 2003 at 19:45 UTC
Do those create HTML directly or do they rely on other modules to create the HTML tag itself? If you want to do a large scale application then by all means look into HTML::Template, and Template, but they (c\|w)ould be overkill for a quick and simple one time "thing" I feel.	[reply]
3Re: HTML::Tree(Builder) in 6 minutes by jeffa (Bishop) on Aug 03, 2003 at 20:07 UTC
They actually do neither ... they are templating modules and have no responsibility of producing valid HTML - that's up to the HTML coder. As for being overkill, well ... the more you use these tools, the quicker you get at coding with them. You can see an example that i am proud of over at 4Re: How do I extract text from an HTML page? that uses HTML::Template. The template is stored inside DATA - creating a new H::T object that uses the DATA filehandle is a snap: `my $template = HTML::Template->new(filehandle => \*DATA);` [download] For the Template-Toolkit quick and simple scripts, check out Inline::TT, it's slow as hell, but when you combine it with Class::DBI you get some amazing results. I am nearly finished with my C::D mini-tut that will demonstrate using C::D with multiple tables, but here is a snippet just to show you the power of the Class::DBI and Template combo. (and by the way, i learned most of this from How to Avoid Writing Code and the poop-group mailing list) Read more... (1516 Bytes) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: 3Re: HTML::Tree(Builder) in 6 minutes by trs80 (Priest) on Aug 03, 2003 at 20:19 UTC
Re^2: 3Re: HTML::Tree(Builder) in 6 minutes by Anonymous Monk on Jan 04, 2008 at 16:45 UTC
•Re: HTML::Tree(Builder) in 6 minutes by merlyn (Sage) on Aug 03, 2003 at 20:58 UTC
Also consider XML::LibXML, which despite its name, can be coaxed into reading HTML, and then provides DOM and XPath interfaces into your HTML tree. It's also far faster than HTML::Tree, keeping the tree in C space, only converting to Perl scalars when necessary. I wrote a column about using it to extract data from a web page. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: HTML::Tree(Builder) in 6 minutes by Anonymous Monk on Nov 30, 2004 at 19:16 UTC
XML::LibXML is very fast, but it can barely parse 1% of the web pages one can find on the Internet because it expects too strict HTML. That's why your 8-lines Perl program at the end of your column doesn't work. Tree::Builder is very slow and does not provide DOM nor XPath. I think that there is nothing in Perl that can parse real web pages while beeing fast and giving access to DOM or XPath. fred	[reply]
Re^2: HTML::Tree(Builder) in 6 minutes by mirod (Canon) on Nov 07, 2009 at 07:53 UTC
A little late to the party... but for future reference, HTML::TreeBuilder::XPath gives you XPath on an HTML::Tree object. And I agree with XML::LibXML not being great at dealing with "real" HTML.	[reply]
Re: HTML::Tree(Builder) in 6 minutes by ido50 (Scribe) on Aug 04, 2003 at 12:28 UTC
Thank you very much for the intro, I think I got a little idea from it (And I'll get back here with it if it works out well). ------------------------- Live fat, die young	[reply]
Re: HTML::Tree(Builder) in 6 minutes by princepawn (Parson) on Aug 04, 2003 at 17:53 UTC
If you want to see HTML::TreeBuilder in action, download and read the source code to HTML::Seamstress. Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality	[reply]
Re: HTML::Tree(Builder) in 6 minutes by Kanishka.black0 (Scribe) on Nov 07, 2009 at 00:15 UTC
Thanks for the Tuit.... This definitely help the Beginners like me ....	[reply]
Re: HTML::Tree(Builder) in 6 minutes by szabgab (Priest) on May 29, 2012 at 20:09 UTC
The article from the Perl Journal can now be found here and here	[reply]


Welcome to the Monastery
	PerlMonks