Re: Reverse engineering HTML


Pathologically Eclectic Rubbish Lister
	PerlMonks

Re: Reverse engineering HTML

by Corion (Patriarch)

on Jun 14, 2001 at 17:52 UTC ( [id://88415]=note: print w/replies, xml )

Need Help??

in reply to Reverse engineering HTML

I've ditched Perl for parsing HTML in favour of HTML-tidy and XSL stylesheets when it comes to extraction of data from HTML.

HTML-tidy is a tool that tries to convert ugly HTML into well-formed XHTML, and it does a good job on it. You might want to preprocess your HTML with it, as it removes a lot of the ugly special cases that make interpreting HTML such a pain.

XSL stylesheets (I use Saxon as the interpreter) provide an easy way to transform XML (and XHTML is a special case of XML) into other ASCII formatted files, using a regular-expression like method (although the syntax is not really the syntax of regular expressions).

If you're not afraid to include the two system calls (HTML-tidy promises a Perl API, and there are XSL-APIs for Perl as well), this might make your work a little bit easier.