Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

lowercasing accented characters

by sewa (Initiate)
on Jun 24, 2010 at 01:01 UTC ( #846207=perlquestion: print w/replies, xml ) Need Help??

sewa has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm expecting the following code to simply lowercase (using Perl 5.8.8):
use strict; use warnings; use locale; use POSIX qw(locale_h); + binmode( STDOUT, ":utf8" ); + my $loc = setlocale(LC_CTYPE); print "LC_CTYPE=$loc\n"; + my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS print "accented char=$accented_char\n"; + my $lowercased = lc( $accented_char ); + print "lowercased=$lowercased\n";
But it prints:
LC_CTYPE=en_US.UTF-8
accented char=
lowercased=
Based on perldoc for lc, I believe this should work, but it doesn't. Interestingly, accepting the input on stdin (with character encoding set to UTF-8 in the terminal) lowercases correctly:
use strict; use warnings; use Encode; + binmode( STDIN, ":utf8" ); binmode( STDOUT, ":utf8" ); + + while( my $char = <> ) { chomp $char; my $lc_char = lc( $char ); print "lowercased $char=$lc_char\n"; }
Any idea as to why the first script wouldn't work? Many thanks.

Replies are listed 'Best First'.
Re: lowercasing accented characters
by Anonymous Monk on Jun 24, 2010 at 01:18 UTC
    Just realized what was wrong.. setting the locale correctly in the first script fixes the problem. The script now looks like:
    use strict; use warnings; use locale; use POSIX qw(locale_h); + binmode( STDOUT, ":utf8" ); + setlocale(LC_CTYPE, "german"); + my $accented_char = "\x{00dc}"; #Upper case U with DIAERESIS print "accented char=$accented_char\n"; + my $lowercased = lc( $accented_char ); + print "lowercased=$lowercased\n";
    The question I have now however is why I need to set the locale in one case (the script above) but not the other (the second script that reads from stdin in my first post).
    According to http://perldoc.perl.org/functions/lc.html, my best guess is that the UTF-8 flag is set on the string when it is read from stdin, but not when the string is instantiated in the code itself. Is there a way to confirm this?
Re: lowercasing accented characters
by ikegami (Patriarch) on Jun 24, 2010 at 06:38 UTC

    The behaviour of lc and such can vary based on how the string is stored internally. This is a bug, but it can't be fixed due to historical reasons. You can work around the problem by switching the internal storage format of the string.

    use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal my $s = "\xDC"; utf8::ugprade( $s ); # Use Unicode semantics print lc($s), "\n";

    Perl 5.12 has a pragma to control the behaviour of lc.

    use open ':std', ':encoding(UTF-8)'; # UTF-8 terminal use feature 'unicode_strings'; # Or "use 5.012;" my $s = "\xDC"; print lc($s), "\n";
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://846207]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2022-06-28 12:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (91 votes). Check out past polls.

    Notices?