Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

HTML::TokeParser Tutorial

by crazyinsomniac (Prior)
on Jul 24, 2001 at 11:16 UTC ( [id://99254]=perltutorial: print w/replies, xml ) Need Help??


NAME

HTML::TokeParser Tutorial (a.k.a. The CPAN Nodelet Faker)


DESCRIPTION

Want to parse HTML the right (and easy) way? Well read this tutorial and you can!!!

(I'd like to thank damian1301 and derek3000 for asking for help, which made me read the pod, and eventually write this.)

The CPAN Nodelet Faker (What's It Do?)

My example program, The CPAN Nodelet Faker, besides teaching you how to use HTML::TokeParser, fetches the latest 20 modules added to http://search.cpan.org/recent.

You can download the source code (without the line numbers, ready to run), as well as this tutorial and sample input/output from http://crazyinsomniac.perlmonk.org/perl/htmltokeparsertutorial

Why Didn't I just use HTML::LinkExtor?

This is an HTML::TokeParser tutorial. Besides, HTML::TokeParser will fit most, if not all, your HTML parsing needs. And, anyway, HTML::LinkExtor is built on top of HTML::Parser just like HTML::TokeParser.


HTML::TokeParser

My comments begin with # and are italicized.

DESCRIPTION (mostly verbatim from the pod)

HTML::TokeParser - Alternative HTML::Parser interface

# What's an n worth to ya -- why couldn't he just call it TokenParser? # Maybe he's a hesher, who knows?

The HTML::TokeParser is an alternative interface to the HTML::Parser class. It basically turns the HTML::Parser inside out. You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document. No need to make a subclass to make the parser do anything.

Calling the methods defined by the HTML::Parser base class will be confusing, so don't do that. Use the following methods instead:

FUNCTIONS

$p=HTML::TokeParser->new($filename || FILEHANDLE ||\$filecontents);
The object constructor argument is either a file name, a file handle object, or the complete document to be parsed. If the argument is a plain scalar, then it is taken as the name of a file to be opened and parsed. If the file can't be opened for reading, then the constructor will return an undefined value and $! will tell you why it failed.

If the argument is a reference to a plain scalar, then this scalar is taken to be the literal document to parse. The value of this scalar should not be changed before all tokens have been extracted.

Otherwise the argument is taken to be some object that the HTML::TokeParser can read() from when it needs more data. Typically it will be a file handle of some kind. The stream will be read() until EOF, but not closed.

$p->get_token
This method will return the next token found in the HTML document, or undef at the end of the document. The token is returned as an array reference. The first element of the array will be a (mostly) single character string denoting the type of this token: ``S'' for start tag, ``E'' for end tag, ``T'' for text, ``C'' for comment, ``D'' for declaration, and ``PI'' for process instructions. The rest of the array is the same as the arguments passed to the corresponding HTML::Parser v2 compatible callbacks (see the HTML::Parser manpage). In summary, returned tokens look like this:
  ["S",  $tag, $attr, $attrseq, $text]
  ["E",  $tag, $text]
  ["T",  $text, $is_data]
  ["C",  $text]
  ["D",  $text]
  ["PI", $token0, $text]

Where $attr is a hash reference, $attrseq is an array reference and the rest is plain scalars.

# $text contains the ``raw'' html, and in the case of text, is actual text;-)``

# $is_data is a ``boolean'' and corresponds to $is_cdata in [HTML::Parser] and if it is set to ``false'', it means that $text contains encoded entities (see [HTML::Entities])

# If you're not clear on what a token is, a token is any html tag (including declaration comment or otherwise), and all stuff in between tags. Basically, anything (which is a type of text - ascii/ansi/unicode...) that is somehow grouped or separated by any kind of html tag, is a token, including the tags themselves.

$p->unget_token($token,...)
If you find out you have read too many tokens you can push them back, so that they are returned the next time $p->get_token is called.

# Basically, HTML::TokeParser keeps an internal ``cursor'' of where in the  file it is, and you can use this method to back up.

# Recomended usage: ``as a last resort'', because there's easier ways to parse HTML without the need to ``seek'' through the file

$p->get_tag([$tag, ...])
This method returns the next start or end tag (skipping any other tokens), or undef if there are no more tags in the document. If one or more arguments are given, then we skip tokens until one of the specified tag types is found. For example:
   $p->get_tag("font", "/font");

will find the next start or end tag for a font-element.

The tag information is returned as an array reference in the same form as for $p->get_token above, but the type code (first element) is missing. A start tag will be returned like this:

  [$tag, $attr, $attrseq, $text]

The tagname of end tags are prefixed with ``/'', i.e. end tag is returned like this:

  ["/$tag", $text]

# Use with *caution* as this is a dangerous method, one that could force you to use unget_token($token,...), as mentioned above

#Reccomended usage: if you wanted to start grabbing stuff only after you encounter BODY (or some other tag)

$p->get_text([$endtag])
This method returns all text found at the current position. It will return a zero length string if the next token is not text. The optional $endtag argument specifies that any text occurring before the given tag is to be returned. Any entities will be converted to their corresponding character.

The $p->{textify} attribute is a hash that defines how certain tags can be treated as text. If the name of a start tag matches a key in this hash then this tag is converted to text. The hash value is used to specify which tag attribute to obtain the text from. If this tag attribute is missing, then the upper case name of the tag enclosed in brackets is returned, e.g. ``[IMG]''. The hash value can also be a subroutine reference. In this case the routine is called with the start tag token content as its argument and the return value is treated as the text.

The default $p->{textify} value is:

  {img => "alt", applet => "alt"}

This means that <IMG> and <APPLET> tags are treated as text, and that the text to substitute can be found in the ALT attribute.

# don't use the $p->{textify} ``technique'', as it is just a bad idea, except in extremely rare cases # however, if you do use something like:

 $p->{textify}= {img => \&ttextify }

# note that ttextify will receive an actuall array (pass by value), as opposed to an arrayref

$p->get_trimmed_text([$endtag])
Same as $p->get_text above, but will collapse any sequences of white space to a single space character. Leading and trailing white space is removed.

#useful when you got a bunch of text, separated by nonsensical whitespace (like in our third trigger)


TRIGGERS (there are only two)

The first trigger looks like:

 <a href="/search?dist=cyrillic-2.08">cyrillic-2.08</a>

We're looking for a ``S''tarting tag, that is called ``a'', and whose, href attribute begins with /search?dist=

The third trigger (:€:) looks like:

 <tr><td colspan=2>
115 distributions have been uploaded
 since 15th July 2001 
</td></tr>

We're looking for a ``S''tart tag, that is called ``td'', which has a ``colspan'' attribute whose value is ``2''

The catch phrase is distributions have been uploaded


LINE-by-LINE CODE EXPLANATION

Lines 1-5: self explanatory (see perlman if you don't understand)

Lines 6-8: unbuffer output (autoflush)

Line 9: $cpanurl is the url of the recently added CPAN modules

Lines 11-13: Declare the array that will contain the latest 20 modules. Initialize the scalar that will contain the number of modules that were added, along with the date. Attempt to ``download'' the page, and load it's contents into $rawHTML using LWP::Simple::get.

Line 15: check to make sure get($cpanurl) returned something. We don't wanna create an entire HTML::TokeParser object, if we have no data to feed it.

Line 18: create a new HTML::TokeParser object ($tp). The die statement is left-over, from when I passed it a filename, but it doesn't hurt much, and something can always go wrong.

---Lines 22-77:START like Line 21 says, a generic HTML::TokeParser loop;º)

Line 25: dereference $token, shift the first value (tag type), save it to $ttype.

Line 27: check to see if we have a start tag (as if you couldn't tell)

Line 29: since it was a start tag, $token is supposed to have 4 more values for us (which for clarity, we've named $tag, $attr, $attrseq, $rawtxt)

Line 31: check to see if we have an anchor(link)

Lines 32-36: since we have an anchor, fetch the value of href, as well as the text in between the opening and closing anchor tag. Since there can be other tokens in between (ex: <a href=""> ... <B>...</a>), even though this particular page won't have any, we use the explicit $tp->get_trimmed_text("/a");

Lines 40-42: push onto @newest20 an array reference, containing the value of the href attribute of our anchor, as well as the text enclosed by the anchor, but only if the href attribute contains our first trigger (/search?dist=)

Line 44: Since our $tag was not an anchor, we test to see if it is a ``td'' with a colspan of 2 (our third trigger).

Lines 48: Since we do have $tag that fits the general description, we go ahead and get the trimmed text up until the next token. (Comments follow, of the same importance as those on Lines 32-36)

Lines 58-59: if the trimmed text ($p_text) contains the catch phrase from our third trigger, se assign it to $lastupdated, thus completing half of our task.

Lines 61-73: if it's not a start tag, check to see if it's any other tag we recognize, and do nothing with that information, since for this particular program, we don't need to.

Line 75: break out of the while loop, if we got our latest 20 modules.

---Lines 22-77:END the end of the generic HTML::TokeParser loop.

Lines 79-80: at this point we don't need $rawHTML or $tp anymore, and since they're not going out of scope till the end of the program, we explicitly undef.

Line 82: output the number of distributions that have been uploaded, but only if we were able to extract that information ($lastupdated contains something).

Lines 84-91: loop through @newest20 perl style, and output html anchors to the modules.

Line 93-94: It never hurts to be explicit(end of the program).


LINE NUMBERED CODE LISTING

   1: #!/usr/bin/perl -w
   2: 
   3: use     strict          ;    # fun with whitespace
   4: use     LWP::Simple;         # what's that? {provides get($url), just `perldoc`}
   5: require HTML::TokeParser;    # Why? because
   6: 
   7: $|=1;                        # un buffer
   8: 
   9: my $cpanurl = 'http://search.cpan.org/recent';
  10: 
  11: my @newest20;                # the top 20 
  12: my $lastupdated = '';        # $n distributions have been uploaded since $date
  13: my $rawHTML = get($cpanurl); # attempt to d/l the page to mem
  14: 
  15: die "LWP::Simple messed up $!" unless ($rawHTML);
  16:                              # Habit.  if it's empty, TokeParser would notice
  17: 
  18: my $tp = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";
  19: 
  20: 
  21: # And now -- a generic HTML::TokeParser loop
  22: 
  23: while (my $token = $tp->get_token)
  24: {
  25:     my $ttype = shift @{ $token };
  26: 
  27:     if($ttype eq "S")    # start tag?
  28:     {
  29:         my($tag, $attr, $attrseq, $rawtxt) = @{ $token };
  30: 
  31:         if($tag eq "a")
  32:         {
  33:             my $a_href = $attr->{'href'};
  34:             my $a_encl = $tp->get_trimmed_text("/$tag");
  35: 
  36: # be sure you understand what get_trimmed_text or get_text are doing
  37: # calling either (as well as get_tag) can drastically change
  38: # the curser position
  39: # in general calling the no argument version, is preferable here
  40: 
  41:             push ( @newest20 , [ $a_href, $a_encl ] )
  42:             if( $a_href =~ /\/search\?dist\=/ );
  43:         }
  44:         elsif( ($tag eq "td") and ($rawtxt =~ /colspan=2/m) )
  45:         {
  46:           # as opposed to checking the hash like exists $attr->{colspan}
  47: 
  48:             my $p_text = $tp->get_trimmed_text;  # p for potential
  49: 
  50: # fetches the "trimmed" up until the next "token"
  51: # passing /td to get_trimmed_text is not advisable, because
  52: # TokeParser would slurp all the text until the next closing /td
  53: # which would in effect cause us to skip halfway down the file
  54: # missing our target links (and pretty much all of them)
  55: # we could always call unget_token, but this is hard.
  56: # like swimming up river (but not as enojoyable)
  57: 
  58:             $lastupdated = $p_text
  59:             if($p_text =~ /distributions have been uploaded/m);
  60:         }
  61:     } # since we know what we're looking for, no need for the rest of these
  62:     elsif($ttype eq "T") # text?
  63:     {
  64:     }
  65:     elsif($ttype eq "C") # comment?
  66:     {
  67:     }
  68:     elsif($ttype eq "E") # end tag?
  69:     {
  70:     }
  71:     elsif($ttype eq "D") # declaration?
  72:     {
  73:     }
  74: 
  75:     last if(scalar @newest20 == 20); # we disappear once we get 20
  76: 
  77: } # endof while (my $token = $p->get_token)
  78: 
  79: undef $rawHTML; # no more raw html
  80: undef $tp;      # destroy the HTML::TokeParser object (don't need it no more)
  81: 
  82: print "<H5> $lastupdated </H5>\n" if($lastupdated); # just in case we miss it
  83: 
  84: for my $arayref (@newest20)
  85: {
  86:     print "<A HREF='http://search.cpan.org";,
  87:           $arayref->[0],     # the link straingt from href
  88:           "'>",
  89:           $arayref->[1],     # the link text
  90:           "</A><BR>\n";
  91: }
  92: 
  93: exit;
  94: __END__


Song in A minor

    AM came from out the maze
    Hitch-hiked on a 56k
    Scratched his head, then tickled his 'board
    Scratched his ass, and then was bored
    He said, hey baby, PLEASE! do my work for me
    She said, no way baby, i'm not that lonely
    And the perled monks go: doo doo doo..
    Crazy came from planet x
    Saw some monk showin' his pecks
    Scratched his head, then pounded his 'board
    Checked politely, consider this node
    He said, hey troll, take a walk on to /dev/null
    Troll said, what, hey i'm not dumb
    And the pereld monks go: dasright R TT FF MM.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://99254]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (None)
    As of 2024-09-12 06:03 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      The PerlMonks site front end has:





      Results (15 votes). Check out past polls.

      Notices?
      erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.