Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
Parsing incorrect htmlby seki (Monk) |
on Jun 07, 2017 at 14:18 UTC ( [id://1192280]=perlquestion: print w/replies, xml ) | Need Help?? |
seki has asked for the wisdom of the Perl Monks concerning the following question: Hi Monks, I need to parse some (externally generated) html and ideally to get the contents of the body to produce some new content. (For the curious, the current processing in production done by an indian-outsourced resource is to collect some files and to concatenate them into a single one (yes, with all individual doctypes, html, head and body tags), we are lucky that it even display something readable in a browser!) So I thought about using HTML::TreeBuilder but some of the individual files are themselves not well-formed, with a content already wrapped in another file (Sure, when you have only a hammer, you see nails everywhere...) so my attempt to get the body results in a weird result: Result (the two bodies seem mixed in a single item):
How would you proceed to get the content of the inner html document? Use another package? I have looked for the options of HTML::Parser used by TreeBuilder but did not seen something relevant
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian
Back to
Seekers of Perl Wisdom
|
|