Looking at two recent frontpaged nodes, I see some ugly critters there, where some non-Ascii characters should be. The nodes are
Seven habits of highly careful coders and
Yet Another Perl/PHP/CF/NET Comparison Question, both under the header "new mediations". The characters are intended to be just curly single and double quotes, but for each character you can see two characters there, which appear to be the bytes representing these characters in UTF-8. They look like this, on the frontpage:
- When he asks “How can I be more careful?”, We usually answer. “That is up to you to figure out” After some thought I’m not sure this is the right approach.
- I'm concerned that they claim they can code Perl but havenÂ’t even heard of CPAN.
Now the odd thing is that if you go look on their own node, it looks just fine:
- When he asks “How can I be more careful?”, We usually answer. “That is up to you to figure out” After some thought I’m not sure this is the right approach.
- I'm concerned that they claim they can code Perl but haven’t even heard of CPAN.
So it looks to me like the data is just fine in the database.
Now, one can only guess what is happening, but a possibility to look into is that a plain ISO-Latin-1 text string could be concatenated with something that Perl has flagged as a UTF-8 string. Whenever that happens, perl will "promote" the ISO-Latin-1 string to UTF-8, turning each of the bytes with value >= 128 into two or three bytes.
A possible fix, to be on the safe side, it's applicable everywhere, is to make every non-Ascii character an entity, either named entities as by using HTML::Entities, or as numerical entities like ¥, where the number is nothing but the ordinal character code in the Unicode/Latin-1 character set.
n.b. These characters in the above posts are actually not in the ISO-Latin-1 repertoire. They are in the Windows character set, though, which is compatible with ISO-Latin-1 plus a few extra printable characters. So in order to be according to the rules, their numerical value should be replaced by their ordinal value in Unicode.
update So the author of my first example fixed up his node, thereby removing my evidence. :( Well I found another one here.