http://qs321.pair.com?node_id=812638


in reply to Wierd behaviour with HTML::Entities::decode_entities()

I'm not very experienced with either of these modules, but, as ikegami points out, some of your code seems strange—for example, if you're going to put the result of decode_entries in another scalar anyway, why use the hard-to-read decode_entities(my $new = $old) rather than the more natural my $new = decode_entities $old? Have you looked at $decodedParsedContentWithDecodeEntities? I'd take a look at that, because, well, you have unexpected behaviour, and it's good to know what's happening every step of the way.

Also note that the Text::Sentence documentation says:

The split sentences function takes a scalar containing ascii text as an argument and returns an array of sentences that the text has been split into.
—that is, it mentions that it's expecting ASCII text, which you're explicitly not giving it.

I'm also puzzled how you can get the gor : BLAH line at all. It seems that you're printing lines of the form word : words (why?), with the left-hand side a word in the right-hand side, but gor doesn't appear in the right-hand side of the output that you displayed.

UPDATE: For that matter, have you looked at $decodedParsedContent itself? A quick look at the non-XS part of the source for HTML::Entities reveals that it's just substituting decimal, then hexadecimal, then named entities. One can imagine a strange scenario where, say, the expansion of a hexadecimal entity creates a decimal entity; it's possible, though (I imagine) unlikely, that you're seeing that here.