http://qs321.pair.com?node_id=846207

sewa has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm expecting the following code to simply lowercase Ü (using Perl 5.8.8):
use strict; use warnings; use locale; use POSIX qw(locale_h); + binmode( STDOUT, ":utf8" ); + my $loc = setlocale(LC_CTYPE); print "LC_CTYPE=$loc\n"; + my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS print "accented char=$accented_char\n"; + my $lowercased = lc( $accented_char ); + print "lowercased=$lowercased\n";
But it prints:
LC_CTYPE=en_US.UTF-8
accented char=Ü
lowercased=Ü
Based on perldoc for lc, I believe this should work, but it doesn't. Interestingly, accepting the input on stdin (with character encoding set to UTF-8 in the terminal) lowercases Ü correctly:
use strict; use warnings; use Encode; + binmode( STDIN, ":utf8" ); binmode( STDOUT, ":utf8" ); + + while( my $char = <> ) { chomp $char; my $lc_char = lc( $char ); print "lowercased $char=$lc_char\n"; }
Any idea as to why the first script wouldn't work? Many thanks.

Replies are listed 'Best First'.
Re: lowercasing accented characters
by Anonymous Monk on Jun 24, 2010 at 01:18 UTC
    Just realized what was wrong.. setting the locale correctly in the first script fixes the problem. The script now looks like:
    use strict; use warnings; use locale; use POSIX qw(locale_h); + binmode( STDOUT, ":utf8" ); + setlocale(LC_CTYPE, "german"); + my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS print "accented char=$accented_char\n"; + my $lowercased = lc( $accented_char ); + print "lowercased=$lowercased\n";
    The question I have now however is why I need to set the locale in one case (the script above) but not the other (the second script that reads from stdin in my first post).
    According to http://perldoc.perl.org/functions/lc.html, my best guess is that the UTF-8 flag is set on the string when it is read from stdin, but not when the string is instantiated in the code itself. Is there a way to confirm this?
Re: lowercasing accented characters
by ikegami (Patriarch) on Jun 24, 2010 at 06:38 UTC

    The behaviour of lc and such can vary based on how the string is stored internally. This is a bug, but it can't be fixed due to historical reasons. You can work around the problem by switching the internal storage format of the string.

    use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal my $s = "\xDC"; utf8::ugprade( $s ); # Use Unicode semantics print lc($s), "\n";

    Perl 5.12 has a pragma to control the behaviour of lc.

    use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal use feature 'unicode_strings'; # Or "use 5.012;" my $s = "\xDC"; print lc($s), "\n";
A reply falls below the community's threshold of quality. You may see it by logging in.