Perl not recognizing Chinese

grsampson has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use Perl to excerpt lines of Chinese poetry from web pages where they are embedded in lots of HTML. According to my copy of the "Programming Perl" book, any version from 5.6 on should deal with Unicode happily -- the Perl on my Mac is many versions later than that. But when I run the script I've written over one of these web pages, where Chinese graphs ("characters") should be printed out I just see question marks. Odder still, there seem to be exactly three question marks per Chinese graph; so far as I know, Unicode uses two bytes per character.

I'm not even sure whether this is a Perl question; I am wondering whether Chinese has been encoded on the web page in some way other than via Unicode. But however it has been encoded, my web browser (Firefox) and my text editor (BBEdit) seem to recognise it fine. I am really at a loss as to how to approach this problem.

I probably should add that my Perl status is probably "intermediate". I have used the language a fair amount, for real tasks rather than just playing, but have never needed to move beyond the core language -- I have never used "pragmas", for instance.

Any advice much appreciated!

Comment on Perl not recognizing Chinese

Replies are listed 'Best First'.
Re: Perl not recognizing Chinese by choroba (Cardinal) on Sep 19, 2018 at 15:09 UTC
Without seeing the code, we can only guess. But let me correct one of your assumptions that's definitely wrong: Unicode uses two bytes per character For characters like ř, it's true, but for Chinese, it's not. UTF-8 is a "variable-length" encoding. `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use open ':encoding(UTF-8)', ':std'; use Encode; chomp( my $chinese = <> ); say length $chinese; my $octets = encode('UTF-8' => $chinese); say length $octets;` [download] Where the input contains (UTF-8 encoded): 焚书坑儒 Output: `4 12` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Perl not recognizing Chinese by haukex (Archbishop) on Sep 19, 2018 at 18:56 UTC
It would be best if you could find out the actual encoding of the page, i.e. whether it's UTF-8, UTF-16 (LE or BE), etc. If you're not sure, you could also post the URL here. Then, depending on how you're loading that data into Perl (which HTTP client etc.) you may need to additionally decode the data. As choroba said, please show your code (see SSCCE). Also, you said you get question marks on output - you also need to tell Perl how to encode its output to the console, e.g. via `use open qw/:std :utf8/;`. However, I would suggest first checking whether the strings have been decoded properly, using `Dump($string)` from Devel::Peek.	[reply] [d/l] [select]
Re: Perl not recognizing Chinese by beech (Parson) on Sep 19, 2018 at 22:23 UTC
Hi This looks like chinese to me #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize; use Data::Dump qw/ dd /; use Encode qw/ encode /; my $ua = WWW::Mechanize->new; $ua->get(q{http://www.google.cn/}); dd( $ua->text ); my $tr = $ua->find_link( url_regex => qr/translate/i )->text; dd( $tr ); dd( encode('UTF-8', $tr ) ); __END__ "Google google.com.hk\x{8BF7}\x{6536}\x{85CF}\x{6211}\x{4EEC}\x{7684} +\x{7F51}\x{5740} \x{7FFB}\x{8BD1}\xA92011 - ICP\x{8BC1}\x{5408}\x{5B5 +7}B2-20070004\x{53F7}" "\x{7FFB}\x{8BD1}" "\xE7\xBF\xBB\xE8\xAF\x91" [download] `"%E7%BF%BB%E8%AF%91"` spells translate perlunitut: Unicode in Perl	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks