Encoding Hell

kettle has asked for the wisdom of the Perl Monks concerning the following question:

I am using perl's AWP user agent to retreive data from the web and am having an awful time dealing with asian characters and other multibyte monstrosities.

Basically, I think it all boils down to what is going on in about three lines of code:

#$ua "a new user agent"
#$req "a request to a website"
#$res "a response to the request"
$res = $ua->get($req);
if($res->is_success){
  $out = $res->content;
  print $out."\n";
}
[download]

My api works fine for latin character based languages, but fouls up horribly when it encounters asian characters - Japanese or Chinese Kanji, or Korean's Hangul alphabet. Instead of returning a nice string of readable characters, $out (or $res I'm not sure which) returns a string of octets corresponding to the individual bytes for these multibyte characters. This alone would not be too awful in and of itself, but in addition, those characters which happen to have alter-egos in the ASCII character set, i.e. those octets which can by properly represented as a single ASCII octet, are being converted to their single byte representation. This means that a single Kanji character gets split into two octets, say \304\211 (fictional!) and the \211 gets converted to a double quote " so that the final output for my script looks like \304". When I try to convert this back to multibyte characters I obviously get garbage. I'd like to know: at what point is perl carrying out this conversion process, and how can I intervene to make it do what I want - that is either not interpret ANY octets, or ideally print out sensible input. I will keep looking for a solution on my own, but I would greatly appreciate any thoughts. Encoding Hell.

Comment on Encoding Hell Select or Download Code

Replies are listed 'Best First'.
Re: Encoding Hell by Anonymous Monk on Aug 08, 2006 at 08:30 UTC
Did you even look at Google Terms of Service. From the looks of it you are 'send automated queries ' which they dont want you to do.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Encoding Hell by rhesa (Vicar) on Aug 08, 2006 at 13:25 UTC
Assuming the web server on the other end is properly configured, it will tell you which encoding to use in the HTTP headers. This may look like `Content-Type: text/html; charset=utf-8` or some other variation thereof. If the charset value is specified, you can pass it on to Encode, and convert to the proper encoding. Alternatively, you can sometimes find this indication in the `<head>` section of the html document. In that case, you should look for a line that looks like this: `<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`. See Encode for details on converting between character sets. See HTTP::Response and HTTP::Header for access to the http headers.	[reply] [d/l] [select]
Re: Encoding Hell by graff (Chancellor) on Aug 09, 2006 at 03:58 UTC
You said: Instead of returning a nice string of readable characters, $out (or $res I'm not sure which) returns a string of octets corresponding to the individual bytes for these multibyte characters... I'd like to know: at what point is perl carrying out this conversion process... The point is: perl is not doing any conversion -- it is giving you the "raw" binary byte stream from the source, without doing any kind of "interpretation" of it. Whatever display tool you are using to view the data as it arrives (and just what are you using to view the data?), it's that tool which is applying the "conversion" (the interpretation of the octet stream) that you find so confusing. The right track, as indicated by rhesa, is to figure out what character encoding is being used for a given chunk of input content, and use Encode so that perl will apply the correct interpretation to the data, and depending on what sort of display tool you use, convert it to the appropriate character set for viewing. Something like this: `use Encode; ... my $inp_enc = ...; # whatever it happens to be my $out_enc = ':utf8'; # or: my $out_enc = 'encoding(big5)'; # (or whatever your display tool expects) binmode STDOUT, $out_enc; ... print decode( $inp_enc, $res->content ) if ( $res->is_success );` [download] (updated to fix a discrepancy in the variable names). The way that works is: the decode call converts the content to perl-internal utf8 encoding; then, whatever mode was set for STDOUT, the print will automatically do the right thing (or try to) -- converting utf8 to something else if need be -- as the content is written to that file handle. (Of course, if you want to output a non-unicode encoding because of your display tool, understand that you will get lots of encoding errors, and nothing worth looking at, if you try printing, say, Chinese text when STDOUT is set to, say, cp1251. That's the problem with non-unicode character sets: they tend to be language-specific.)	[reply] [d/l]
Re^2: Encoding Hell by kettle (Beadle) on Aug 10, 2006 at 02:11 UTC
"Whatever display tool you are using to view the data as it arrives (and just what are you using to view the data?), it's that tool which is applying the "conversion" (the interpretation of the octet stream) that you find so confusing." This is not precisely true - and I never said I found it confusing... It does matter that whatever one uses to view the data be set to the same encoding that the output has been set to, but this is not the whole story. The byte stream must also be decoded properly, i.e. it must match the encoding at the source - otherwise perl makes assumptions about the input byte stream. After that one can make changes according to one's 'display tool', but leaving a shift-jis encoded byte stream as is, and then expecting the unicode decoding of this stream to work properly is not Ok. It is clear from the code that this is understood but the wording of this post unnecessarily obfuscates the fact that perl has default settings which are not always appropriate. I don't really know why this post turned so negative; but I guess it must be my fault. Anyway the problem as mentioned a ways above, is long solved, so I guess I shan't be harking back again.	[reply]
Re^3: Encoding Hell by graff (Chancellor) on Aug 10, 2006 at 17:44 UTC
The byte stream must also be decoded properly ... That's the point that rhesa and I were making, and which was absent in the OP code. ... otherwise perl makes assumptions about the input byte stream. Well, if you want to put it in those terms, you could say "perl assumes that whatever byte stream comes in, that is what will be printed (unless your script specifically applies some other interpretation or conversion, either using Encode or via a PerlIO encoding layer on the output file handle). leaving a shift-jis encoded byte stream as is, and then expecting the unicode decoding of this stream to work properly is not Ok I'm not sure what you're talking about here. If you know you have shift-jis data, and you want to convert it to unicode, that's definitely okay, so long as you actually apply some process to do that (perl won't do it "implicitly"). (update: I just remembered something: in case you happen to be running Perl 5.8.0 on a Red-Hat 9 system, then there is a good chance that your defaults include a "locale" setting, which, on that combination of Perl/OS versions, caused Perl to make an implicit ("default") attempt to coerce input/output data between unicode and the encoding implied by the locale. This murdered countless applications and was fixed in later versions of Perl. If this is your situation, it's long past time to upgrade.) It is clear from the code that this is understood but the wording of this post unnecessarily obfuscates the fact that perl has default settings which are not always appropriate. Again, this is a bit hard to follow... which code are you referring to here? Which wording is obfuscating? Of course default settings are not always appropriate -- that's why there are alternatives to default settings... I don't really know why this post turned so negative; Me neither. That first reply (and its subthread) really threw me. If anything I said seemed negative, I apologize for that -- I generally try to keep my tone neutral, but of course I don't always succeed. (updated to fix typos)	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks