Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Unexpected utf8 in hash keys

by kappa (Chaplain)
on Feb 20, 2008 at 11:22 UTC ( [id://668987]=perlquestion: print w/replies, xml ) Need Help??

kappa has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monks! I'm seeing strange things going on with my hash keys.
#! /usr/bin/perl use strict; use warnings; use utf8; sub U { return utf8::is_utf8($_[0])?'is_utf8':'not_utf8'; } my %s = ( MaxAccountSize1 => 1, 'MaxAccountSize2' => 1, 2 => 1, ); foreach (sort keys %s) { print "'$_' ".U($_)." => '".$s{$_}."' ".U($s{$_})."\n"; }
It looks like in presence of use utf8 hash keys upgraded from barewords via virtues of => operator get the utf8 flag. Quoted string literals on the contrary get this flag on if they contain characters with high codes -- in full accordance with the docs. It cost us a lot of blood and sweat to debug why some perfectly ASCII strings would suddenly get the flag.

Is there any rationale in such decision? Is it a bug? Does anyone know what the performance penalty of utf8 hash keys -- even if they contain only ASCII chars -- is?

--kap

Replies are listed 'Best First'.
Re: Unexpected utf8 in hash keys
by kappa (Chaplain) on Feb 20, 2008 at 11:27 UTC
    Fat comma is really at fault:
    use utf8; sub ff { die if utf8::is_utf8($_[0]); } ff(asd => 1);
    --kap

      encoding produces the opposite result.

      #use encoding 'UTF-8'; # or 'utf8' #use utf8; use Encode qw( is_utf8 ); sub ff { print is_utf8($_[0]) ? 1 : 0, "\n"; } # none enco utf8 both # ---- ---- ---- ---- ff("asd"); # 0 1 0 1 ff('asd'); # 0 1 0 1 ff(qw(asd)); # 0 1 0 1 ff(asd => 1); # 0 0 1 1 { no strict; ff(asd); } # 0 0 1 1 ff(-asd); # 0 0 1 1 ff(asd::); # 0 0 1 1

      After some investigation, barewords in general (anything parsed by force_word) are affected.

      use utf8; sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } ff("asd"); # 0 ff('asd'); # 0 ff(qw(asd)); # 0 ff(asd => 1); # 1 { no strict; ff(asd); } # 1 ff(-asd); # 1 ff(asd::); # 1

      I think function names (including built-ins) and keywords are also affected.

      Gotta go. No time to devise a fix atm.

      I'd expect both of the following calls to output the same thing, so I think it's a bug.

      use utf8; sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } ff(asd => 1); # 1 ff("asd"); # 0

      Tested in 5.8.8 and 5.10.0.

Re: Unexpected utf8 in hash keys
by graff (Chancellor) on Feb 20, 2008 at 13:57 UTC
    It's definitely worthwhile to know and understand the trickiness demonstrated so clearly by ikegami's various tests, and I would agree that some of his results point to "actionable" inconsistencies that should probably be treated as bugs. BUT... when you say:

    It cost us a lot of blood and sweat to debug why some perfectly ASCII strings would suddenly get the flag.

    Does this mean you were using the utf8 flag to determine whether or not a string contains wide characters? That is not what the flag is for, and you shouldn't be using it that way. To test for wide characters in a string, use a regex:

    if ( /[^[:ascii:]]/ ) { ... } # which is equivalent to if ( /[^\x00-\x7f]/ ) { ... }
    The purpose of the utf8 flag, as I understand it, is to answer the question: if there happen to be non-ASCII bytes in this string, are they to be interpreted as utf8 characters, or not? The treatment of an all-ASCII string should be the same regardless of whether the utf8 flag is set.
      Yes, it was a wrong way to do that -- an attempt based on wrong guess that perl would not set utf8 flag on ASCII strings.
      --kap
Re: Unexpected utf8 in hash keys
by pc88mxer (Vicar) on Feb 20, 2008 at 14:13 UTC
    Actually, the issue really is the use utf8 which allows you to use utf8 in your program identifiers. For instance:

    use strict; use warnings; my %hash = ( asd => 1 ); sub ff { print utf8::is_utf8($_[0]) ? 1 : 0, "\n"; } eval { use utf8; ff(%hash); # now prints 0 };

    I suppose it's reasonable that perl encodes barewords as utf8 if you use utf8 even if only ascii characters are involved.

    Update: from the utf8 documentation:

    The "use utf8" pragma tells the Perl parser to allow UTF-8 in the pro- gram text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). ... Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are useful for their own purposes, but they are not really part of the "pragmatic" effect.
      We actually use lots of non-ASCII strings so we need use utf8.
      --kap
Re: Unexpected utf8 in hash keys
by doc_faustroll (Scribe) on Feb 20, 2008 at 18:19 UTC
    I may be reading this too quickly and not appreciating or understanding your intent, or what exactly you are using the use utf8 pragma for or why. My hunch is that you might benefit from not enquiring as to when the utf8 flag is set or not. Can you tell me your purposes? We can play around with the internals of Perl and how it represents strings all day long, but I gather you have more than academic interest here? This may be a place where solid pragmatism trumps. Have you read perlunitut?
Re: Unexpected utf8 in hash keys
by Juerd (Abbot) on Feb 21, 2008 at 01:42 UTC

    Out of curiosity, why does the flag bother you?

      We are in the process of converting a huge production codebase from 5.6 + Unicode::String to 5.8. This is as painful as it can get and we log this flag in a lot of places just to find codepaths that need attention.
      --kap
Re: Unexpected utf8 in hash keys
by creamygoodness (Curate) on Aug 27, 2009 at 11:12 UTC
    Does anyone know what the performance penalty of utf8 hash keys -- even if they contain only ASCII chars -- is?
    Benchmarking script: Results for vanilla custom-compiled Perl 5.10.0 on Mac OS X:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://668987]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-19 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found