using the headers method of HTML::TableExtract to find an image

brainpan has asked for the wisdom of the Perl Monks concerning the following question:

brainpan emerges from the shadows of the monastery and humbly seeks enlightenment from his elders.

I'm wanting to parse the content out of an html table using HTML::TableExtract. For most of the data this only takes a few lines, but for some reason I can't make it search for a header when that header consists only of an image (for which I know the URL). I assume that the source of the problem lies in the fact that, as TableExtract is a subclass of HTML::Parser, it's no longer seeing the url for the image as text that it should be parsing. If I were dealing with HTML::TokeParser I'd work around this with a line like this:

$tokeparser->{textify} = {img => 'src'};

However, I can't figure out how to do this with HTML::Parser. Am I approaching this the right way? Do I need to 'textify' HTML::Parser objects to make HTML::TableExtract search for the image's url, or can all this be done interfacing only with TableExtract? Is there some better way to extract the data from an HTML table when using an image as an anchor point?

And no, I don't own 2 7 pairs of sweatpants.

Comment on using the headers method of HTML::TableExtract to find an image Download Code

Replies are listed 'Best First'.
Re: using the headers method of HTML::TableExtract to find an image by kal (Hermit) on Apr 02, 2001 at 17:19 UTC
Forgive me, but I'm not exactly sure if I understand your question. If I haven't, try to rephrase - with examples, if possible. Now, by my understanding, you're trying to pick out a table with a <img ..> tag in the <th..> tag? I've never tried this myself, but it's quite possible that it's only evaluating text nodes - that is, the tag is markup, not content, even if it has attributes. This is obvious, because <img ..> is an empty tag - in X/HTML, it would be written <img ../>, making it plain it contains no text nodes. Probably the best way will be to write your own parser in HTML::Parser, or (better) extend HTML::TableExtract to make it possible to use 'nodes' (the tags :) and their attributes within the evaluation. Or, if you're dealing with XHTML, you could parse it using an XML::Parser, and then use XML::XPath to generate a query which would automatically find your answer! (Check out XPath if you haven't before - you can search through parsed XML trees for tags based on their name, their text content, their attributes, their lineage, etc. - sooper :) That's the preferred way, probably, but I suspect you're parsing someone else's web pages, so I guess it's probably not possible. Have I made any sense??	[reply]
Re: using the headers method of HTML::TableExtract to find an image by brainpan (Monk) on Apr 03, 2001 at 00:16 UTC
I should have known better than to create a root node that only contained one line of actual perl `code`. Let's try this again, this time fueled by a bit more sleep. My goal is to extract the data from a table (for this example we'll use this one), where I know only the headers for the fields. Thanks to HTML::TableExtract's `headers` method, this is quite simple: use strict; use HTML::TableExtract; # I'm using LWP in the real code, but this is a minimalistic attempt a +t a working example my $html_doc_name = '/tmp/symbols.html'; my $html_doc_string; my $te = new HTML::TableExtract( headers => ['Character', 'Entity'] ); my $ts; my $row; undef $/; # the absence of this one little line always causes me + so much trouble open(HTML, $html_doc_name) or die "Couldn't open html file: $!\n"; $html_doc_string = <HTML>; close(HTML) or die "Couldn't close html file: $!\n"; $te->parse($html_doc_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join("\t\t", @$row), "\n"; } } [download] This gives me the data I'm looking for. However, if the header I'm looking for is an image (usually of stylized text stating what the columns represent), this ceases to work. Say that, rather than those columns being labeled 'Character' and 'Entity' they were `<img src="http://www.htmlhelp.com/images/Character.jpeg">` and `<img src="http://www.htmlhelp.com/images/Entity.jpeg">`, respectively. With this one, seemingly minor change to the headers, this code suddenly won't work, even if I make the appropriate modifications to the header criteria. As stated above, my suspicion is that this is due to the fact that, as the image urls are now HTML::Parser objects rather than plain text, HTML::TableExtract is skipping over them and looking only in the plaintext portion of the html. My question is this: is there a way to make TableExtract look in the image tags for my selection criteria? If I can't do that directly, can I tell HTML::Parser itself that I'd like it to treat image tags as plain text, (presumably making TableExtract work as it does with plaintext headers)? Is there perhaps some other method entirely which I should be using? Hopefully this time my question is clear enough to warrant something other than upvotes for effort. :). And no, I don't own 2 7 pairs of sweatpants.	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks