Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Reverse engineering HTML

by Corion (Patriarch)
on Jun 14, 2001 at 17:52 UTC ( [id://88415]=note: print w/replies, xml ) Need Help??


in reply to Reverse engineering HTML

I've ditched Perl for parsing HTML in favour of HTML-tidy and XSL stylesheets when it comes to extraction of data from HTML.

HTML-tidy is a tool that tries to convert ugly HTML into well-formed XHTML, and it does a good job on it. You might want to preprocess your HTML with it, as it removes a lot of the ugly special cases that make interpreting HTML such a pain.

XSL stylesheets (I use Saxon as the interpreter) provide an easy way to transform XML (and XHTML is a special case of XML) into other ASCII formatted files, using a regular-expression like method (although the syntax is not really the syntax of regular expressions).

If you're not afraid to include the two system calls (HTML-tidy promises a Perl API, and there are XSL-APIs for Perl as well), this might make your work a little bit easier.

Replies are listed 'Best First'.
Re: Re: Reverse engineering HTML
by THRAK (Monk) on Jun 14, 2001 at 21:06 UTC
    I have to give a big ++ to Corion for this advice. If you have malformed HTML, running it through Tidy will definately make it far more useable. Although there is currently not a Perl implementation of it (WHAH!), it is very easy to incorporate via a Perl system call. If you have a lot of pages to process, you can build a Perl looping structure and process them one after another. If this is part of an inline process, you can run each file through before you Parse or do whatever with it. I'm currently implementing such an inline Tidy & Perl HTML::Parser process into an existing PHP process. If you have any question, feel free to contact me.

    -THRAK
    www.polarlava.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://88415]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-19 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found