Extracting ALT text from image links

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting ALT text from image links by rob_au (Abbot) on Jun 25, 2002 at 12:47 UTC
How about this using the ever-venerable HTML::TokeParser? Also too, have a look at the HTML::TokeParser tutorial on this site here use HTML::TokeParser; use LWP::Simple; my $content = get('http://www.yoursite.com'); my (@alt, $link); my $parser = HTML::TokeParser->new(\$content) \|\| die $!; while (my $token = $parser->get_token) { my $type = shift @{$token}; if ($type eq 'E') { my ($tag) = @{$token}; $link = 0 if $tag eq 'a'; } elsif ($type eq 'S') { my ($tag, $attr, $attrseq, $text) = @{$token}; $link = 1 if $tag eq 'a'; next unless $tag eq 'img'; next unless defined $attr->{'alt'} and length $attr->{'alt'}; push @alt, { $attr->{'src'} => $attr->{'alt'} } if $link; } } [download]	[reply] [d/l]
Re: Extracting ALT text from image links by broquaint (Abbot) on Jun 25, 2002 at 13:23 UTC
There's also the oft-neglected `HTML::PullParser` to come to your aid. Here's a non-complete example of how you might use it `use strict; use HTML::PullParser; my $p = HTML::PullParser->new( file => shift @ARGV, start => 'tagname, @attr' ); while(my $t = $p->get_token()) { my($tagname, %attr) = @$t; print "alt text is $attr{alt}", $/ if exists $attr{alt}; }` [download] Remember to check out the docs for more info on the module (specifically the `start` and `end` events will be needed to get `img` tags from within `a` tags). HTH `_________ broquaint`	[reply] [d/l]
Re: Extracting ALT text from image links by gav^ (Curate) on Jun 25, 2002 at 14:23 UTC
Just for completeness, here is an example using HTML::TreeBuilder: `use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($html); foreach my $img ($tree->look_down('_tag', 'img')) { if ($img->attr('alt')) { print "Alt tag found: ", $img->attr('alt'), "\n"; } } $tree->delete;` [download] gav^	[reply] [d/l]
Re: Extracting ALT text from image links by Matts (Deacon) on Jun 25, 2002 at 15:13 UTC
Ooh, lots of different solutions. Here's one using XML::LibXML: `#!/usr/bin/perl -w use strict; use XML::LibXML; my $file = $ARGV[0] \|\| die "Usage: $0 [uri\|filename]\n"; my $doc = XML::LibXML->new->parse_html_file($file); print "Alt tags in $file:\n"; foreach my $alt ($doc->findnodes('//img/@alt')) { print "Alt tag: ", $alt->nodeValue, "\n"; } print "Done\n";` [download]	[reply] [d/l]
Re: Extracting ALT text from image links by Jenda (Abbot) on Jun 25, 2002 at 20:05 UTC
For completeness sake ... this time using HTML::Parser: use HTML::Parser; $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], marked_sections => 1, ); { my $in_link = 0; sub start { my($tagname, $attr) = @_; if ($tagname eq 'a') { $in_link = 1; } elsif ($in_link and $tagname eq 'img' and exists $attr->{alt +}) { print "IMG: $attr->{src} = $attr->{alt}\n"; } } sub end { $in_link = 0 if ($_[0] eq 'a'); } } $p->parse('sadf dsfg<a href="foo.html"><iMg src="foo.gif" alt="blah">< +/a> <img src="bar.gif" alt="nenene"> sdf'); $p->eof(); [download] Jenda	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks