Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: A Character Set Enquiry

by pc88mxer (Vicar)
on Jul 10, 2008 at 21:06 UTC ( [id://696811]=note: print w/replies, xml ) Need Help??


in reply to A Character Set Enquiry

Perl doesn't have a preferred character set. The preferred representation for text in a perl program is to use code-points which is 'character set' independent. When you export your text data to file or database you'll need to set up the correct code-point to character set mapping for the file or database. This mapping is called an encoding.

Ideally, this is how your program would operate:

1) It reads the UTF-8 byte stream and decode it into code-points.

2) It encodes the code-points back into UTF-8 for storage into the database.

3) When reading from the database, it decodes the data return from the database back into code-points.

4) When printing the data to the user, it encodes the code-points via the encoding suitable for display to the user's screen.

So there are a lot of places where the handling of the text can get screwed up. In fact, it is possible that your data is stored correctly in the database, but it is only when you print it out that it doesn't look right. You'll have to debug each step of the process to determine where your text is not being handled correctly.

Here is generally how to handle each of the four situations above:

use Encode; # case 1 - reading from a file open(F, "<:utf8", ...); # or use binmode # case 2 - storing text into a database $sth = $dbh->prepare("INSERT INTO ... VALUES (?)"); $sth->execute( encode("utf8", $text) ); # case 3 - reading from a database my @vals = $dbh->fetchrow_array; @vals = map { decode("utf8", $_) } @vals; # case 4 - writing to a file or STDOUT binmode STDOUT, ":utf8"; print $text;
You should also consult your database documentation to see if its doing any encoding translation under the hood.

A useful routine I've used a lot to debug these problems is:

sub ord_dump { join(' ', map { ord($_) } split(//, $_[0])); } print ord_dump($text), "\n";

Replies are listed 'Best First'.
Re^2: A Character Set Enquiry
by moritz (Cardinal) on Jul 10, 2008 at 21:21 UTC
    Perl doesn't have a preferred character set.

    Not quite true. If you read binary data, and try to treat it as text data (like using uc or lc) it's treated as Latin-1.

    In fact, it is possible that your data is stored correctly in the database, but it is only when you print it out that it doesn't look right.

    Very unlikely if he dumped UTF-8 data into a Latin-1 database and then converted it to UTF-8

      By default, arbitrary data with the utf8 flag on will be treated as unicode characters (equivalent to latin-1 through codepoint 255). But by default without the flag on, it is treated as specified by the C locale, which is pretty much just ASCII. Try it: (remove the -CO if you have a non-utf8 terminal)
      $ perl -CO -wle'print lc "\xc9"; print lc substr "\x{100}\xc9", 1'
      This outputs É then é.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://696811]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-04-20 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found