comment on

I'm not very experienced with either of these modules, but, as ikegami points out, some of your code seems strange—for example, if you're going to put the result of decode_entries in another scalar anyway, why use the hard-to-read decode_entities(my $new = $old) rather than the more natural my $new = decode_entities $old? Have you looked at $decodedParsedContentWithDecodeEntities? I'd take a look at that, because, well, you have unexpected behaviour, and it's good to know what's happening every step of the way.

Also note that the Text::Sentence documentation says:

The split sentences function takes a scalar containing ascii text as an argument and returns an array of sentences that the text has been split into.

—that is, it mentions that it's expecting ASCII text, which you're explicitly not giving it.

I'm also puzzled how you can get the gor : BLAH line at all. It seems that you're printing lines of the form word : words (why?), with the left-hand side a word in the right-hand side, but gor doesn't appear in the right-hand side of the output that you displayed.

UPDATE: For that matter, have you looked at $decodedParsedContent itself? A quick look at the non-XS part of the source for HTML::Entities reveals that it's just substituting decimal, then hexadecimal, then named entities. One can imagine a strange scenario where, say, the expansion of a hexadecimal entity creates a decimal entity; it's possible, though (I imagine) unlikely, that you're seeing that here.

In reply to Re: Wierd behaviour with HTML::Entities::decode_entities() by JadeNB
in thread Wierd behaviour with HTML::Entities::decode_entities() by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks