perlquestion
larsen
I suppose you were faced at least once with the problem of
reverse engineering extremely poor written HTML.
In this case the issue is not simply [cpan://HTML::Parse|parsing]
HTML, but guessing the structure (is there's a structure)
behind what has been written by some absent minded HTML-coder
(artificial, human or a messy combination of both).<p>
Here an example of what I'm coping with:
<code>
</tr>
</table><html>
<body bgcolor="#FFFFFF">
<table border="0" cellspacing="0" cellpadding="0"> <link rel="stylesheet" href="/stile.css" type="text/css">
<tr>
<td>
<span class="span">
<font color="#FF0000" size="1" face="Verdana, Arial">29/05/2001 15:30</font>
<font size="1" face="Verdana, Arial">(ACOI - associazione Chirurghi ospedalieri italiani)
<br>
<a href="http://www.immedia.it/published/20010529/2001052914683.shtml" target="ImmediaPress"><font size="1" face="Verdana, Arial">
<b>IN DUEMILA DA TUTTO IL MONDO AL CONGRESSO
DELL'ASSOCIAZIONE DEI CHIRURGHI OSPEDALIERI ITALIANI </b></font>
</a>
<br>
<!-- <font face="verdana, arial, helvetica" size="1">
(IMMEDIAPRESS) Modena e' stata per quattro giorni una citta' internazionale grazie...
</font> -->
</font>
</td>
</tr>
<tr>
<td>
<img src="http://www.immedia.it/images/line_home.gif" width=308 height=1 border=0 alt=""><br>
</td>
</tr>
<tr><td height="6"></td></tr>
</table>
</body>
</html><html>
<body bgcolor="#FFFFFF">
</code>...and so on.<p>
Now I'm using Adobe GoLive to dig through HTML code, since
it provides a tree view that is the best I'm aware of.
I ask you if there are common tecniques or general principles
to deal with such problems. Thank you.
<p><small>2001-06-16 Edit by [Corion] : Fixed link</small></p>