Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Extracting ALT text from image links

by Anonymous Monk
on Jun 25, 2002 at 12:41 UTC ( [id://177073]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there an easy way to extract the alt text of links that are images from a HTML document? Thanks for your help! S.

Replies are listed 'Best First'.
Re: Extracting ALT text from image links
by rob_au (Abbot) on Jun 25, 2002 at 12:47 UTC
    How about this using the ever-venerable HTML::TokeParser? Also too, have a look at the HTML::TokeParser tutorial on this site here
    use HTML::TokeParser; use LWP::Simple; my $content = get('http://www.yoursite.com'); my (@alt, $link); my $parser = HTML::TokeParser->new(\$content) || die $!; while (my $token = $parser->get_token) { my $type = shift @{$token}; if ($type eq 'E') { my ($tag) = @{$token}; $link = 0 if $tag eq 'a'; } elsif ($type eq 'S') { my ($tag, $attr, $attrseq, $text) = @{$token}; $link = 1 if $tag eq 'a'; next unless $tag eq 'img'; next unless defined $attr->{'alt'} and length $attr->{'alt'}; push @alt, { $attr->{'src'} => $attr->{'alt'} } if $link; } }

     

Re: Extracting ALT text from image links
by broquaint (Abbot) on Jun 25, 2002 at 13:23 UTC
    There's also the oft-neglected HTML::PullParser to come to your aid. Here's a non-complete example of how you might use it
    use strict; use HTML::PullParser; my $p = HTML::PullParser->new( file => shift @ARGV, start => 'tagname, @attr' ); while(my $t = $p->get_token()) { my($tagname, %attr) = @$t; print "alt text is $attr{alt}", $/ if exists $attr{alt}; }
    Remember to check out the docs for more info on the module (specifically the start and end events will be needed to get img tags from *within* a tags).
    HTH

    _________
    broquaint

Re: Extracting ALT text from image links
by gav^ (Curate) on Jun 25, 2002 at 14:23 UTC
    Just for completeness, here is an example using HTML::TreeBuilder:
    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($html); foreach my $img ($tree->look_down('_tag', 'img')) { if ($img->attr('alt')) { print "Alt tag found: ", $img->attr('alt'), "\n"; } } $tree->delete;

    gav^

Re: Extracting ALT text from image links
by Matts (Deacon) on Jun 25, 2002 at 15:13 UTC
    Ooh, lots of different solutions. Here's one using XML::LibXML:

    #!/usr/bin/perl -w use strict; use XML::LibXML; my $file = $ARGV[0] || die "Usage: $0 [uri|filename]\n"; my $doc = XML::LibXML->new->parse_html_file($file); print "Alt tags in $file:\n"; foreach my $alt ($doc->findnodes('//img/@alt')) { print "Alt tag: ", $alt->nodeValue, "\n"; } print "Done\n";
Re: Extracting ALT text from image links
by Jenda (Abbot) on Jun 25, 2002 at 20:05 UTC

    For completeness sake ... this time using HTML::Parser:

    use HTML::Parser; $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], marked_sections => 1, ); { my $in_link = 0; sub start { my($tagname, $attr) = @_; if ($tagname eq 'a') { $in_link = 1; } elsif ($in_link and $tagname eq 'img' and exists $attr->{alt +}) { print "IMG: $attr->{src} = $attr->{alt}\n"; } } sub end { $in_link = 0 if ($_[0] eq 'a'); } } $p->parse('sadf dsfg<a href="foo.html"><iMg src="foo.gif" alt="blah">< +/a> <img src="bar.gif" alt="nenene"> sdf'); $p->eof();

      Jenda

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://177073]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 23:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found