Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Perl Character Set

by indiansummersky (Beadle)
on Nov 01, 2005 at 18:56 UTC ( [id://504694]=perlquestion: print w/replies, xml ) Need Help??

indiansummersky has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Can anyone helpme with this MySQL/DBI.pm issue? Server, database, relevant table and relevant column all have character set = latin1 (default collation). This handles normal US-ASCII characters hansomely but when it comes to characters like Â, Ã and Ä (see execute argument below), the result in the table row is garbled. Each character ends up as an individual shape of some sort such as: ┬ ├ ─

Note that entering characters like Â, Ã and Ä directly into the table produces the correct results. I know the issue has something to do with how Perl, or the DBI in particular, handles non-standard (albeit latin1) characters.

The basic code follows. Many thanks!
# Connect to database my $dbh = DBI->connect("DBI:mysql:database",'username','password') or +print DBI->errstr; # Prepare and execute query my $query = $dbh->prepare_cached('INSERT INTO test (text) VALUES (?)') + or print $dbh->errstr; $query->execute('A Â Ã Ä') or print $query->errstr; # Disconnect $dbh->disconnect;

Replies are listed 'Best First'.
Re: Perl Character Set
by fizbin (Chaplain) on Nov 02, 2005 at 03:11 UTC

    Let me guess what the translation is:
    À ->└ Lower right box corner
    Á ->┴ Middle box bottom
     ->┬ Middle box top
    à->├ Left middle box side
    Ä ->─ Center box bar horizontal
    This is exactly the translation you would expect if the database is being filled with data in iso-latin-1 and the program extracting it is interpreting data in cp437. My guess is that you're looking at the database with the command-line mysql tool running inside a windows cmd window. (On a US version of windows)

    Here's the truth: You're going to have to deal with character encodings; you can't just close your eyes and ignore them. So here's the short version, so that you have to deal with as little as possible:

    The windows cmd window doesn't display things in iso-latin-1. Instead, it uses an old DOS character set that is compatible with iso-latin-1 essentially only for ascii. In short, perl's doing everything fine, but the dos box displays characters in its own special non-standard way. To see that this issue is the dos box, and not your script, at the cmd window say: type myscript.pl, where script contains iso-latin-1 characters.

    To "fix" this problem, at the C:\> prompt, type chcp 850. Then re-run the command-line tool you have to investigate the database.

    Update: Never mind; code page 850 alone isn't enough to get the DOS box to display things in iso-latin-1. This might work after saying chcp 850, assuming that your perl is version 5.8 or better:

    C:\>mydatabaseexamcommand | perl -MEncode=from_to -pe "from_to($_,q[la +tin1],q[cp850])"
    Another option is to run your database command and redirect everything into a file. Then open that file with notepad or wordpad or whatever your favorite Windows editor is. Yet another option is to examine the database with mysql's own graphical "mysql browser" tool.

    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Perl Character Set
by JamesNC (Chaplain) on Nov 02, 2005 at 01:18 UTC
    On MySQL's site you can read the docs for your database. http://dev.mysql.com/doc/refman/5.0/en/configure-options.html shows how to config for different encodings. You don't say what datatype you are using, but I use varbinary for any data I think might need unicode support. You really just have to hanker down with the docs. Also, there are other ways to be more specific about the way you bind to your place holders where you tell DBI what datatype to use ( look for :sql_type in the perldoc on DBI, for example you might do something like this for your query:
    use DBI qw/:sql_types/; #... $query->bind_param(1, 'A Â Ã Ä', SQL_VARBINARY );
    JamesNC
Re: Perl Character Set
by Aristotle (Chancellor) on Nov 01, 2005 at 19:02 UTC

    What encoding is your source code in?

    Do you get the same results if you insert user input into the database? What encoding is the user input in?

    Makeshifts last the longest.

      Sorry, I'm not sure about encoding. All I can say is that the perl code is save as plain text and user input is via a basic html <input type="text"> box inside a <form> with no enctype specified. The HTML is in latin1 ISO-8859-1.

      As for entering directly into the Database, as I said, entering from the console produces prefect results, therefore it is nothing to do with the database as such, just the way in which the data gets to it via Perl and DBI.

        The point about the source code is to do with your test code. What ends up in the database depends on a) the encoding of the source file and b) the locale your perl runs under. If Perl assumes the source is in a locale-specific 8-bit encoding, it will take the string’s bytes verbatim from the source code and put them in a non-UTF-8 string. Which means the bytes you have in your source would end up in the database.

        But it’s a different question what happens when you stuff user input there. That too depends on what locale Perl is running under (mostly, whether it’s UTF-8), but the other factor is which encoding the user’s input had, which is clearly not necessarily the same as the encoding of your source code.

        And whether entering text directly into the database from the console works correctly depends on what encoding your terminal emulator produces your input as, and what locale the commandline database utility runs under.

        (Can you tell why I so wish everything was UTF-8 already? Layers upon layers, this is an onion of opportunities to screw up…)

        Makeshifts last the longest.

        The DB console application may be running with a different encoding than your Perl application, and/or it may be telling the DB explicitly what encoding it is using.

        I would suggest (1)setting your HTML form's enctype explicitly, and (2)learning how to tell your DB that you are speaking a specific encoding. The latter might be a SQL statement, an option to your DBI->connect, or possibly some combination of the two.

        It would help, a little, to know which DBD you're using.

        <-radiant.matrix->
        A collection of thoughts and links from the minds of geeks
        The Code that can be seen is not the true Code
        "In any sufficiently large group of people, most are idiots" - Kaa's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://504694]
Approved by davido
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-19 15:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found