http://qs321.pair.com?node_id=260301


in reply to Having HTML::Parser problem

From the HTML::Parser docs:

$p->unbroken_text( $bool )

By default, blocks of text are given to the text handler as soon as possible (but the parser makes sure to always break text at the boundary between whitespace and non-whitespace so single words and entities always can be decoded safely). This might create breaks that make it hard to do transformations on the text. When this attribute is enabled, blocks of text are always reported in one piece. This will delay the text event until the following (non-text) event has been recognized by the parser.

Note that the offset argspec will give you the offset of the first segment of text and length is the combined length of the segments. Since there might be ignored tags in between, these numbers can't be used to directly index in the original document file.

90% of every Perl application is already written.
dragonchild