Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Perl not recognizing Chinese

by grsampson (Initiate)
on Sep 19, 2018 at 14:36 UTC ( [id://1222648]=perlquestion: print w/replies, xml ) Need Help??

grsampson has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use Perl to excerpt lines of Chinese poetry from web pages where they are embedded in lots of HTML. According to my copy of the "Programming Perl" book, any version from 5.6 on should deal with Unicode happily -- the Perl on my Mac is many versions later than that. But when I run the script I've written over one of these web pages, where Chinese graphs ("characters") should be printed out I just see question marks. Odder still, there seem to be exactly three question marks per Chinese graph; so far as I know, Unicode uses two bytes per character.

I'm not even sure whether this is a Perl question; I am wondering whether Chinese has been encoded on the web page in some way other than via Unicode. But however it has been encoded, my web browser (Firefox) and my text editor (BBEdit) seem to recognise it fine. I am really at a loss as to how to approach this problem.

I probably should add that my Perl status is probably "intermediate". I have used the language a fair amount, for real tasks rather than just playing, but have never needed to move beyond the core language -- I have never used "pragmas", for instance.

Any advice much appreciated!

Replies are listed 'Best First'.
Re: Perl not recognizing Chinese
by choroba (Cardinal) on Sep 19, 2018 at 15:09 UTC
    Without seeing the code, we can only guess. But let me correct one of your assumptions that's definitely wrong:
    Unicode uses two bytes per character

    For characters like ř, it's true, but for Chinese, it's not. UTF-8 is a "variable-length" encoding.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use open ':encoding(UTF-8)', ':std'; use Encode; chomp( my $chinese = <> ); say length $chinese; my $octets = encode('UTF-8' => $chinese); say length $octets;

    Where the input contains (UTF-8 encoded):

    焚书坑儒
    

    Output:

    4 12

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Perl not recognizing Chinese
by haukex (Archbishop) on Sep 19, 2018 at 18:56 UTC

    It would be best if you could find out the actual encoding of the page, i.e. whether it's UTF-8, UTF-16 (LE or BE), etc. If you're not sure, you could also post the URL here. Then, depending on how you're loading that data into Perl (which HTTP client etc.) you may need to additionally decode the data. As choroba said, please show your code (see SSCCE).

    Also, you said you get question marks on output - you also need to tell Perl how to encode its output to the console, e.g. via use open qw/:std :utf8/;. However, I would suggest first checking whether the strings have been decoded properly, using Dump($string) from Devel::Peek.

Re: Perl not recognizing Chinese
by beech (Parson) on Sep 19, 2018 at 22:23 UTC

    Hi

    This looks like chinese to me

    #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize; use Data::Dump qw/ dd /; use Encode qw/ encode /; my $ua = WWW::Mechanize->new; $ua->get(q{http://www.google.cn/}); dd( $ua->text ); my $tr = $ua->find_link( url_regex => qr/translate/i )->text; dd( $tr ); dd( encode('UTF-8', $tr ) ); __END__ "Google google.com.hk\x{8BF7}\x{6536}\x{85CF}\x{6211}\x{4EEC}\x{7684} +\x{7F51}\x{5740} \x{7FFB}\x{8BD1}\xA92011 - ICP\x{8BC1}\x{5408}\x{5B5 +7}B2-20070004\x{53F7}" "\x{7FFB}\x{8BD1}" "\xE7\xBF\xBB\xE8\xAF\x91"

    "%E7%BF%BB%E8%AF%91" spells translate

    perlunitut: Unicode in Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1222648]
Approved by dorko
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-23 06:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found