comment on

(OP here -- sorry, I should have obtained a username before starting this)

Sorry also for the delay, I stopped to do more tests based on the useful information you've all given me. The "wide character" error happens if I'm trying to decode a string that already has what I have been calling "unicode" extended characters in it. (Remember I said this is all a great mystery to me and I'd really like it to all go away forever? That includes my incorrect terminology.) That is, if it's already got characters such as \x{103}, trying to decode them will produce that error. This turns out to be because one of my data sources sends extended characters in one format and one in a different format (this is an API that has to merge data from several sources for a single output stream).

Or, more concretely: One of my data sources sends lower-case-a-with-breve ă as \xc4\x83, which is the kind that does need translating for my purposes, and the other data source sends it as \x{103}, which for my purposes is already translated into the format I need. decode('UTF-8') works properly on the former and errors on the latter, which seems to be correct behavior based on what you've said. I didn't realize the two data sources were doing it differently (neither of them has any documentation of what they do, alas) and I picked the wrong horse for my previous test.

The reason I was calling the former of those "UTF-8" and the latter "Unicode" was because of pages like https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&names=-&utf8=string-literal, where I look up characters for when I have to translate them by brute force ... I'm still not sure of the correct term for the longer Unicode encoding where ă is \x{103} (AKA U+0103).

Anyway, thank you! It does look like decode() does the right thing, when its user isn't dumb.

In reply to Re^4: UTF-8 and Unicode the hard way by Anonymous Monk
in thread UTF-8 and Unicode the hard way by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks