Behaviour of Encode::decode

jbert has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks. My reading of the docs for decode function in the Encode module suggests that the utf8 flag on the resulting string should be off if the input string was plain ASCII. i.e. I'd expect the following:

#!/usr/bin/perl
use warnings;
use strict;
use Encode;;

my $octets = "abcd";

my $ustr = Encode::decode('utf8', $octets);

print Encode::is_utf8($ustr)
      ? "is" : "isn't",
      " tagged as a unicode string\n";
[download]

to print "isn't tagged as a unicode string".

But I don't, I get "is tagged as a unicode string" as the output.

What I believe to be the relevant section of the docs is:

Here is how Encode takes care of the utf8 flag.

    * When you encode, the resulting utf8 flag is always off.
    * When you decode, the resulting utf8 flag is on unless you can
      unambiguously represent data. Here is the definition of
      dis-ambiguity.

      After $utf8 = decode('foo', $octet);,

        When $octet is...   The utf8 flag in $utf8 is
        ---------------------------------------------
        In ASCII only (or EBCDIC only)            OFF
        In ISO-8859-1                              ON
        In any other Encoding                      ON
        ---------------------------------------------
[download]

I'm sure I've seen this not-tagging-ASCII behaviour in older versions of Encode (I think 2.01), too.

I'm running perl 5.8.8, Encode 2.12.

Can anyone else confirm that this is a bug, or have I got my encodings and flags in a twist.

Comment on Behaviour of Encode::decode_utf8 on ASCII Select or Download Code

Replies are listed 'Best First'.
Re: Behaviour of Encode::decode_utf8 on ASCII by Joost (Canon) on Feb 14, 2007 at 19:41 UTC
Note that the docs do not specify for what encoding this should work. This is arguably a bug, but it's probably a bug in the docs; I would expect decode('utf8',$string) to just flag any input as utf8 since that's by far the most efficient way of "decoding" utf8. It shouldn't really matter anyway. ASCII => utf8. Update: it does the same for me. perl 5.8.8, Encode 2.12. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 14, 2007 at 19:46 UTC
There's also this in the docs: `CAVEAT: When you run $string = decode("utf8", $octets), then $string m +ay not be equal to $octets. Though they both contain the same data, t +he utf8 flag for $string is on unless $octets entirely consists of AS +CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.` [download] and the difference does matter from a performance point of view. UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings. Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case. This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.	[reply] [d/l]
Re^3: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 06:11 UTC
I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app. Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary? If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8: `if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }` [download] (updated as per fenLisesi's reply -- thanks!) Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better: `if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }` [download] I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.	[reply] [d/l] [select]
Re^4: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 15, 2007 at 08:13 UTC
Re^5: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 09:37 UTC
Some notes below your chosen depth have not been shown here
Re^4: Behaviour of Encode::decode_utf8 on ASCII by fenLisesi (Priest) on Feb 15, 2007 at 09:47 UTC
Re: Behaviour of Encode::decode_utf8 on ASCII by ikegami (Patriarch) on Feb 14, 2007 at 19:48 UTC
5.8.0 and 5.8.8 both return the same result for me. `is tagged as a unicode string` [download] `This is perl, v5.8.0 built for MSWin32-x86-multi-thread Binary build 806 provided by ActiveState Corp. Built 00:45:44 Mar 31 2003 Encode 1.83` [download] `This is perl, v5.8.8 built for MSWin32-x86-multi-thread Binary build 817 [257965] provided by ActiveState Built Mar 20 2006 17:54:25 Encode 2.12` [download] It would make no sense for it not to be tagged. When one asks to decode a string of bytes (UTF8 off) to a string of chars (UTF8 on), it makes no sense that the same call sometimes returns a string of chars and sometimes get a string of bytes. I'd say the bug is in the docs.	[reply] [d/l] [select]
Re^2: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 14, 2007 at 20:02 UTC
Except there is the issue of efficiency (see my other post above). Representing a string of characters which all happen to lie within the ASCII range as an untagged byte string allows the byte-oriented regex engine to be used. It's a very similar idea to using machine words to hold integers up to a certain value, and then switching to a different representation for bignums. It doesn't make a difference to correctness, but it does make a difference to performance.	[reply]
Re^3: Behaviour of Encode::decode_utf8 on ASCII by demerphq (Chancellor) on Feb 16, 2007 at 23:14 UTC
To repeat what i said elsewhere, IMO this is a bug that should be reported. Id do it for you, but using perlbug on win32 is a real PITA. --- $world=~s/war/peace/g	[reply]
Re: Behaviour of Encode::decode_utf8 on ASCII by brycen (Monk) on Mar 30, 2010 at 20:21 UTC
A related node is here: http://www.perlmonks.org/?node_id=831664. Bryce Nesbitt, Berkeley Electronic Press, Berkeley CA	[reply]
Re: Behaviour of Encode::decode_utf8 on ASCII by mrajcok (Initiate) on Mar 09, 2010 at 15:10 UTC
I just ran into this (still existing) bug rt://34259. I'm using Encode v2.39. In my testing, it appears that whenever decode() is called, the UTF8 flag always gets set, no matter what characters are in the string, and no matter which character set is specified -- even when specifying 'ascii' as the character set: `use Encode; my $t = "abc"; my $d = decode('ascii',$t); printf "is_utf8=%d\n", (utf8::is_utf8($d) ? 1 : 0);` [download] output: `is_utf8=1` [download] I did find a workaround: it appears that utf8::decode($string) does NOT set the UTF8 flag if $string only contains ASCII characters.	[reply] [d/l] [select]

Behaviour of Encode::decode_utf8 on ASCII