Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Parse... then what? (HTML Parsing problems)

by THRAK (Monk)
on Aug 20, 2001 at 16:27 UTC ( [id://106198]=note: print w/replies, xml ) Need Help??


in reply to Parse... then what? (HTML Parsing problems)

Chady,

I'm with ichimunki on this one, use HTML::TokeParser. Here's a basic working snippet of code based on a parser I'm working on. This may be of help to you:
#!/usr/local/bin/perl -w ########################################################### # includes ################################################ ########################################################### use strict; use HTML::TokeParser; ################# ### Variables ### ################# my $file_in = 'test.html'; ################## ### Parse HTML ### ################## my $p = HTML::TokeParser->new($file_in) || die "Can't open: $!"; ## while (my $token = $p->get_token) { my $token_type = @$token[0]; start(@$token[1], @$token[4]) if ($token_type =~ /S/i); # Start Ta +g end(@$token[1], @$token[2]) if ($token_type =~ /E/i); # End Tag text(@$token[1]) if ($token_type =~ /T/i); # Text comment(@$token[1]) if($token_type =~ /C/i); # Comment declaration(@$token[1]) if ($token_type =~ /D/i); # Declaration } ########################################################### # SUB's ################################################### ########################################################### ############# ### DTD's ### ############# sub declaration { my ($declaration) = @_; print "DEC: $declaration\n"; } ################ ### Comments ### ################ sub comment { my ($comment) = @_; print "CMT: $comment\n"; } ##################### ### Text Entities ### ##################### sub text { my ($text) = @_; return if ($text =~ /^(\s+)$/); #skip blank lines $text =~ s/\s+/ /g; #kill off big chunks of whitespace $text =~ s/\n//g; #keep text split across lines together print "TEXT: $text\n"; } ################## ### Start Tags ### ################## sub start { my ($tag, $origtext) = @_; chomp $origtext; print "ST: $tag = $origtext\n"; } ################ ### End Tags ### ################ sub end { my ($tag, $origtext) = @_; chomp $origtext; print "ET: $tag = $origtext\n"; }
You'll need to add whatever logic to grab what tags you need either in the parsing while loop or with one of the sub-routines.

-THRAK
www.polarlava.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://106198]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2024-04-19 14:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found