Keep It Simple, Stupid | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
What an ugly mess... I pity you... :-)
I'm curious as to why there are multiple <HTML> tags in the same document? Assuming that's not an artifact that you created, I would split this huge document up into several parts using these tags as 'delimiters', and handle each piece separately (since multiple <HTML> tags have no value). Within those individual pieces, it might be easier to see structure. In this case, the person used a one-column table probably to get some effect, but it's otherwise useless from what I can tell. Programmically, if all you can about is extracting the information from the page, it might just be easier to use lynx to get the text versions, possibly intelligently adding <P>, <A>, and <UL> tags and ignoring reset of the formatting, to at least give you a starting point where you have not lost any of the content and can begin anew with the HTML design.
Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain In reply to Re: Reverse engineering HTML
by Masem
|
|