Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Behaviour of Encode::decode_utf8 on ASCII

by jbert (Priest)
on Feb 14, 2007 at 19:26 UTC ( [id://600050]=perlquestion: print w/replies, xml ) Need Help??

jbert has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks. My reading of the docs for decode function in the Encode module suggests that the utf8 flag on the resulting string should be off if the input string was plain ASCII. i.e. I'd expect the following:
#!/usr/bin/perl use warnings; use strict; use Encode;; my $octets = "abcd"; my $ustr = Encode::decode('utf8', $octets); print Encode::is_utf8($ustr) ? "is" : "isn't", " tagged as a unicode string\n";
to print "isn't tagged as a unicode string".

But I don't, I get "is tagged as a unicode string" as the output.

What I believe to be the relevant section of the docs is:

Here is how Encode takes care of the utf8 flag. * When you encode, the resulting utf8 flag is always off. * When you decode, the resulting utf8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After $utf8 = decode('foo', $octet);, When $octet is... The utf8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON In any other Encoding ON ---------------------------------------------
I'm sure I've seen this not-tagging-ASCII behaviour in older versions of Encode (I think 2.01), too.

I'm running perl 5.8.8, Encode 2.12.

Can anyone else confirm that this is a bug, or have I got my encodings and flags in a twist.

Replies are listed 'Best First'.
Re: Behaviour of Encode::decode_utf8 on ASCII
by Joost (Canon) on Feb 14, 2007 at 19:41 UTC
    Note that the docs do not specify for what encoding this should work. This is arguably a bug, but it's probably a bug in the docs; I would expect decode('utf8',$string) to just flag any input as utf8 since that's by far the most efficient way of "decoding" utf8.

    It shouldn't really matter anyway. ASCII => utf8.

    Update: it does the same for me. perl 5.8.8, Encode 2.12.

      There's also this in the docs:
      CAVEAT: When you run $string = decode("utf8", $octets), then $string m +ay not be equal to $octets. Though they both contain the same data, t +he utf8 flag for $string is on unless $octets entirely consists of AS +CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.
      and the difference does matter from a performance point of view.

      UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings.

      Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case.

      This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.

        I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app.

        Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary?

        If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8:

        if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }
        (updated as per fenLisesi's reply -- thanks!)

        Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better:

        if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }
        I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.
Re: Behaviour of Encode::decode_utf8 on ASCII
by ikegami (Patriarch) on Feb 14, 2007 at 19:48 UTC

    5.8.0 and 5.8.8 both return the same result for me.

    is tagged as a unicode string
    This is perl, v5.8.0 built for MSWin32-x86-multi-thread Binary build 806 provided by ActiveState Corp. Built 00:45:44 Mar 31 2003 Encode 1.83
    This is perl, v5.8.8 built for MSWin32-x86-multi-thread Binary build 817 [257965] provided by ActiveState Built Mar 20 2006 17:54:25 Encode 2.12

    It would make no sense for it not to be tagged. When one asks to decode a string of bytes (UTF8 off) to a string of chars (UTF8 on), it makes no sense that the same call sometimes returns a string of chars and sometimes get a string of bytes.

    I'd say the bug is in the docs.

      Except there is the issue of efficiency (see my other post above). Representing a string of characters which all happen to lie within the ASCII range as an untagged byte string allows the byte-oriented regex engine to be used.

      It's a very similar idea to using machine words to hold integers up to a certain value, and then switching to a different representation for bignums. It doesn't make a difference to correctness, but it does make a difference to performance.

        To repeat what i said elsewhere, IMO this is a bug that should be reported.

        Id do it for you, but using perlbug on win32 is a real PITA.

        ---
        $world=~s/war/peace/g

Re: Behaviour of Encode::decode_utf8 on ASCII
by brycen (Monk) on Mar 30, 2010 at 20:21 UTC
Re: Behaviour of Encode::decode_utf8 on ASCII
by mrajcok (Initiate) on Mar 09, 2010 at 15:10 UTC
    I just ran into this (still existing) bug rt://34259. I'm using Encode v2.39. In my testing, it appears that whenever decode() is called, the UTF8 flag always gets set, no matter what characters are in the string, and no matter which character set is specified -- even when specifying 'ascii' as the character set:
    use Encode; my $t = "abc"; my $d = decode('ascii',$t); printf "is_utf8=%d\n", (utf8::is_utf8($d) ? 1 : 0);
    output:
    is_utf8=1
    I did find a workaround: it appears that utf8::decode($string) does NOT set the UTF8 flag if $string only contains ASCII characters.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://600050]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 14:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found