Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Converting everything (MySql, perl, CGI, website) to UTF-8

by jfrm (Monk)
on Mar 16, 2018 at 08:01 UTC ( [id://1211014]=perlmeditation: print w/replies, xml ) Need Help??

In order to deal with Japanese orders, I recently had to convert my whole system to UTF-8. A day or 2's job I thought. 2.5 weeks later, I'm finally there. There is a lot of stuff on Perlmonks and the internet in general about this but it is hard to understand and even harder to implement. Most of the advice I read was along the lines of RTFM or did not give the whole story. It's pretty clear this is a common problem, too. I wanted to give something back to the community as perlmonks has helped me a lot, so I thought I would share some insights that I hope will be practical and useful.

There is a lot out there telling you to used decode/encode and giving lectures on internal representation of UTF8 in Perl and wotnot. In the end I've only had to use decode in one place where data is coming in from elsewhere. If you get all the other stuff right, I believe you shouldn't need any or many instances of decode/encode.

Our system involves a local website using MySQL, a live website, static webpages, generated webpages, various text files and CGI website forms. All of this needs work to make it work. Here are the things that I needed to do:

Checklist of changes to make

* Firstly, every script file is converted to UTF-8 format. Easy.

* Every script to have this at the top: use utf8; This tells perl that the script itself is in UTF format. So a £ in the script will be interpreted as a UTF-8 £. It's no good just putting this in the calling script as it only seems to extend for the scope of the script underneath; not any other scripts that are imported with require...

* Ideally each database table must be turned to UTF-8 format. This turns out to be difficult and time-consuming because any tables with foreign keys won't convert unless you first delete the foreign keys. For those that won't easily convert, you can convert only the fields that might hold UTF-8 encoded characters to UTF-8 format. Also BLOB fields are a problem unless the whole table is UTF-8. I had to convert problem BLOB fields to TEXT fields and then convert them to UTF-8 format (a 2 step process, doing both in 1 step fails).

* Rose::DB (or whatever database method you are using) needs to be told that incoming data from the Database is in UTF-8. For Rose:DB, add this to the connector in DB.pm and then regenerate connect_options => {mysql_enable_utf8 => 1}

* binmode(STDOUT, ":utf8"); # Put this at the top of a script - tells it to output UTF to stdout. Not sure if this is just needed only once in the opening script or in any requires, too?

* Webpages must have this in the head section: <meta http-equiv="content-type" content="text/html; charset=UTF-8">

* use CGI qw(-utf8); to treat incoming CGI parameters as UTF-8. Getting this working was subtle - test carefully.

* When outputting a CGI webpage, the first thing to do is to output the http header and this needs to be told about UTF8 too: Personally I found that print header(-type=>'text/html', -cookie=>'', -charset=>'utf-8'); gave problems with cookies so ended up outputting it direct: print "Content-type: text/html; charset=utf-8\n$cookie\n\n";

* use open ':encoding(utf8)'; # tells it to deal with all files in a UTF8 way. In fact, I was more careful with this and did not use it in general. Instead, I have specifically opened each file that needed it with open($fh, '<:encoding(UTF-8)', $filename);. Because some files that I have to deal with have not been given to me in UTF-8 format. Careful - this can fail if the $filename variable is not also in UTF8!

Identifying Errors

In doing this, you will make mistakes and see weird characters appearing in unexpected places. I developed my own personal understanding of how to deal with them. These are my own notes for practical situations so please bear with me, if the explanations are not exactly correct - it was about fixing stuff not being a perl rocket scientist.

  • You see £ displayed as '£'
    • If £ sign is coming from dbase and is stored correctly in dbase and webpage is correctly displaying UTF-8 characters from elsewhere (e.g. write japanese text into the perl script and print it), then the UTF-8 is not being retrieved from the database as UTF-8 (presumably being assumed to be Latin1).
    • The £ is within a UTF-8 encoded PERL script but use utf8; is not set at the top of the script.
    • The £ is displayed correctly in a form initially but when the form is saved/updated, the £ then displays as '£'. Use the -utf8 CGI pragma to treat incoming parameters as UTF-8: use CGI ('-utf8');
  • £ is displayed on a webpage as �
    • This happens when the http header Content Type is not UTF8 and the meta tag is similarly <meta http-equiv="Content-Type" content="text/html" />
  • £ or other characters are being displayed as a diamond with ? inside it
    • StackOverflow:...usually the sign of an invalid (non-UTF-8) character showing up in an output (like a page) that has been declared to be UTF-8. Can be fixed by putting the following at the top of script: binmode(STDOUT, ":utf8");
  • Error message: Wide character in print
    • Means a print statement (to STDOUT or a file) that is outputting Latin1 includes a UTF-8 character... To fix, add '>:encoding (UTF-8) to the open statement or #binmode(STDOUT, ":utf8");

Replies are listed 'Best First'.
Re: Converting everything (MySql, perl, CGI, website) to UTF-8
by choroba (Cardinal) on Mar 16, 2018 at 08:33 UTC
  • You can use <li> to make bullets, they look much nicer than asterisks (see this post for an example).

  • use UTF;

    You probably mean

    use utf8;

  • Note that in MySQL, you can configure the encoding used for:
    • storing the data
    • sending the data
    • getting the data
    plus various defaults in case any of the specific ones wasn't specified.

  • Note that :utf8 and :encoding(UTF-8) layers aren't identical. The latter is more strict and does more checks.

  • this can fail if the $filename variable is not also in UTF8

    The encoding of the contents is in no way related to the encoding of the filename. Moreover, at least in a web app, the filename on your system should be controlled by you, not a user input, so you should always know what encoding the filename uses.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thanks for correction and formatting advice; I wrote the post in a hurry but have tidied it a bit today.

      This is a list of problems that arose in the real world that, regrettably, I find myself inhabiting; Information comes into my programs from all manner of different places and in some, I can assure you that the open command crashed until I used I first converted the filename viz: my $utf8filename = encode('utf8', $payfile);. Perhaps there was a better way, but sometimes you just need to kill the obscure error expediently.

Re: Converting everything (MySql, perl, CGI, website) to UTF-8
by davebaker (Pilgrim) on Oct 13, 2020 at 01:09 UTC

    It's also helpful to include accept-charset="utf-8" inside the form tag in any forms on any HTML pages you've created. That way, if your server were to be configured to send "Content-Type: text/html; charset=ISO-8859-1" in the httpd response header of the HTML page containing the form (accidentally, or for whatever other reason, such as its being the default for Apache), the text entered by the user into the form nevertheless would be encoded by the user's browser as UTF-8 when it's submitted. Without an accept-charset attribute inside the form tag, the browser in that scenario would encode the text using whatever encoding is specified in the httpd response header sent by Apache or other web server that serves up the HTML page.

    Also, if the HTML page with the form has a meta tag that specifies a character set other than UTF-8, such as <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> or <meta charset=ISO-8859-1>, having accept-charset="utf-8" inside the form tag would cause the browser to use UTF-8 encoding for the submitted text. (It's true that most browsers ignore such meta tags when a web server already has sent charset=UTF-8 in the header, but a copy of the web page might have been saved to the user's computer and thereafter used to submit text in a form, in which case there would be no headers from a server to prevent the meta tags from controlling the page's character set and corresponding encoding of submitted text.)

Re: Converting everything (MySql, perl, CGI, website) to UTF-8
by davebaker (Pilgrim) on Sep 28, 2020 at 19:37 UTC

    Fabulous contribution. Thanks!

    Regarding "every script file is converted to UTF-8 format" -- I am thinking that conversion of the text of Perl scripts from ISO-8859-1 isn't necessary in order to work with Unicode characters in the other ways you've described, unless one wants to be able to directly "type" a Unicode character into the script, e.g.

    my $default_name = 'Ǣsop';

    Do you agree?

      utf8 enables not only Unicode string constants, but also Unicode identifiers (variables, methods, subroutines, …).

      If you use Unicode only in String constants (and only occasionally), then it is even possible to stay in ASCII and use charnames' \N{CHARNAME} sequences, so that e.g.
      my $default_name = '\N{LATIN CAPITAL LETTER AE WITH MACRON}sop';

      would be equivalent to your example. With Perl v5.16 or later, you don't even need to explicitly use charnames for these sequences.

      This approach also avoids problems with several (mostly web) frontends of git (or other) repositories that don't handle Unicode well.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1211014]
Approved by choroba
Front-paged by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 01:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found