Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Wierd behaviour with HTML::Entities::decode_entities()

by JadeNB (Chaplain)
on Dec 14, 2009 at 04:16 UTC ( [id://812638]=note: print w/replies, xml ) Need Help??


in reply to Wierd behaviour with HTML::Entities::decode_entities()

I'm not very experienced with either of these modules, but, as ikegami points out, some of your code seems strange—for example, if you're going to put the result of decode_entries in another scalar anyway, why use the hard-to-read decode_entities(my $new = $old) rather than the more natural my $new = decode_entities $old? Have you looked at $decodedParsedContentWithDecodeEntities? I'd take a look at that, because, well, you have unexpected behaviour, and it's good to know what's happening every step of the way.

Also note that the Text::Sentence documentation says:

The split sentences function takes a scalar containing ascii text as an argument and returns an array of sentences that the text has been split into.
—that is, it mentions that it's expecting ASCII text, which you're explicitly not giving it.

I'm also puzzled how you can get the gor : BLAH line at all. It seems that you're printing lines of the form word : words (why?), with the left-hand side a word in the right-hand side, but gor doesn't appear in the right-hand side of the output that you displayed.

UPDATE: For that matter, have you looked at $decodedParsedContent itself? A quick look at the non-XS part of the source for HTML::Entities reveals that it's just substituting decimal, then hexadecimal, then named entities. One can imagine a strange scenario where, say, the expansion of a hexadecimal entity creates a decimal entity; it's possible, though (I imagine) unlikely, that you're seeing that here.

Replies are listed 'Best First'.
Re^2: Wierd behaviour with HTML::Entities::decode_entities()
by ikegami (Patriarch) on Dec 14, 2009 at 06:22 UTC

    One can imagine a strange scenario where,

    If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few.

    I think you're thinking of double-encoding, where "foo" was accidentally encoded as

    "foo"
    when it should have been encoded as
    "foo"
      If the scenario you describe is possible, it's a bug in HTML::Entities. And it's not what the OP is seeing. The OP claims he need to do too many decodings, not too few.

      I don't think that I'm claiming what you think I'm claiming. :-) What I meant was that, say, 
 (or even just 
) would be interpreted (incorrectly) as 
 by two passes of decode_entities, but not by one. This gives “unexpected decoding” after the second pass, but it's not a bug in HTML::Entities.

      (UPDATE: I meant what I meant, but it wasn't quite what I said. A better example is a, which becomes a after one pass of the decoder and then (incorrectly) a after another. This is very particular to the ordering I mentioned earlier (first decimal, then hexadecimal, then named entities are expanded). This is the ordering in the pure-Perl decode_entities_old in HTML::Entities; I have no idea if the XS version also behaves this way. Perhaps you thought that I was mentioning that, say, &#amp;quot; " would be incorrectly converted to "? You're right, it seems to me that that is what will happen, and that it is a bug.)

      On the other hand, I couldn't, and can't, think of a way that this would give the behaviour that the OP is seeing. The kind of double-encoding you mentioned sounds far more likely—and the remedy, I think, is the same, to look at the intermediate steps along the way to see where something's going wrong. (Actually, I guess that's so generic that it's true for just about any problem.)

        This is very particular to the ordering I mentioned earlier (first decimal, then hexadecimal, then named entities are expanded)

        No, it isn't

        Different nesting order:

        >perl -MHTML::Entities -le"print decode_entities '"'; " >perl -MHTML::Entities -le"print decode_entities '"'; "

        Different sibling order:

        [ Can't find a valid example ]

        Or I still don't understand. Please given an example where ordering matters.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://812638]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-03-29 00:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found