lowercasing accented characters

sewa has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm expecting the following code to simply lowercase Ü (using Perl 5.8.8):

use strict;
use warnings;
use locale;
use POSIX qw(locale_h);                                               
+         
binmode( STDOUT, ":utf8" );                                           
+                                    
my $loc = setlocale(LC_CTYPE);
print "LC_CTYPE=$loc\n";                                              
+                                 
my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS
print "accented char=$accented_char\n";                               
+                                               
my $lowercased = lc( $accented_char );                                
+                                               
print "lowercased=$lowercased\n";
[download]

But it prints:

LC_CTYPE=en_US.UTF-8
accented char=Ü
lowercased=Ü

Based on perldoc for lc, I believe this should work, but it doesn't. Interestingly, accepting the input on stdin (with character encoding set to UTF-8 in the terminal) lowercases Ü correctly:

use strict;
use warnings;
use Encode;                                                           
+                    
binmode( STDIN,  ":utf8" );
binmode( STDOUT, ":utf8" );                                           
+                                                                     
+                                             
while( my $char = <> ) {
  chomp $char;
  my $lc_char = lc( $char );
  print "lowercased $char=$lc_char\n";
}
[download]

Any idea as to why the first script wouldn't work? Many thanks.

Comment on lowercasing accented characters Select or Download Code

Replies are listed 'Best First'.
Re: lowercasing accented characters by Anonymous Monk on Jun 24, 2010 at 01:18 UTC
Just realized what was wrong.. setting the locale correctly in the first script fixes the problem. The script now looks like: `use strict; use warnings; use locale; use POSIX qw(locale_h); + binmode( STDOUT, ":utf8" ); + setlocale(LC_CTYPE, "german"); + my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS print "accented char=$accented_char\n"; + my $lowercased = lc( $accented_char ); + print "lowercased=$lowercased\n";` [download] The question I have now however is why I need to set the locale in one case (the script above) but not the other (the second script that reads from stdin in my first post). According to http://perldoc.perl.org/functions/lc.html, my best guess is that the UTF-8 flag is set on the string when it is read from stdin, but not when the string is instantiated in the code itself. Is there a way to confirm this?	[reply] [d/l]
Re: lowercasing accented characters by ikegami (Patriarch) on Jun 24, 2010 at 06:38 UTC
The behaviour of `lc` and such can vary based on how the string is stored internally. This is a bug, but it can't be fixed due to historical reasons. You can work around the problem by switching the internal storage format of the string. `use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal my $s = "\xDC"; utf8::ugprade( $s ); # Use Unicode semantics print lc($s), "\n";` [download] Perl 5.12 has a pragma to control the behaviour of `lc`. `use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal use feature 'unicode_strings'; # Or "use 5.012;" my $s = "\xDC"; print lc($s), "\n";` [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.


more useful options
	PerlMonks