Re^2: Best Way to Get Length of UTF-8 String in Bytes?

Thank you, ikegami.

Here's what I had tried before posting my inquiry:

#!perl

use strict;
use warnings;
use open qw( :utf8 :std );
use utf8;

# 'China' in Simplified Chinese
#          中        国
# Unicode  U+4E2D    U+56FD
# UTF-8    E4 B8 AD  E5 9B BD

my $text = '中国';
my $length_in_characters = length $text;
print "Length of text '$text' in characters is $length_in_characters\n";

{
    use bytes;
    my $length_in_bytes = length $text;
    print "Length of text '$text' in bytes is $length_in_bytes\n";
}

{
    require Encode;
    my $bytes = Encode::encode_utf8($text);
    my $length_in_bytes = length $bytes;
    print "Length of text '$bytes' in bytes is $length_in_bytes\n";
}

And here's its output:

Length of text '中国' in characters is 2
Length of text 'дёе›Ѕ' in bytes is 6
Length of text 'дёе›Ѕ' in bytes is 6

(I couldn't use <code> tags here due to the Chinese characters in both the script and its output.)

Jim

Comment on Re^2: Best Way to Get Length of UTF-8 String in Bytes?

Replies are listed 'Best First'.
Re^3: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 03:19 UTC
Are you trying to suggest you could use bytes? That would be incorrect. bytes does not give UTF-8, it gives the internal storage format of the string. That may be utf8 (similiar to UTF-8) or just bytes. Here's an example of it giving the incorrect answer: `#!perl use strict; use warnings; use open qw( :encoding(cp437) :std ); use utf8; my $text = chr(0xC9); my $length_in_characters = length $text; print "Length of text '$text' in characters is $length_in_characters\n +"; { use bytes; my $length_in_bytes = length $text; print "Length of text '$text' in bytes is $length_in_bytes\n"; } { require Encode; my $bytes = Encode::encode_utf8($text); my $length_in_bytes = length $bytes; print "Length of text '$bytes' in bytes is $length_in_bytes\n"; }` [download] `Length of text 'Й' in characters is 1 Length of text 'Й' in bytes is 1 "\x{00c3}" does not map to cp437 at a.pl line 22. "\x{0089}" does not map to cp437 at a.pl line 22. Length of text '\x{00c3}\x{0089}' in bytes is 2` [download]	[reply] [d/l] [select]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 05:53 UTC
I don’t know what all that Microsoft noise was for — nor the `use utf8` either for that matter — but we’re all perfectly familiar with “the Unicode bug” thank you very much. And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed. `% perl -CS -E 'say chr(0xe9)' \| perl -CS -nE 'require bytes; say byte +s::length($_); chomp; say bytes::length($_)' 3 2 % perl -E '$x = "\x{e9}\x{3b1}"; require bytes; say bytes::length($x); + chop $x; say bytes::length($x)' 4 2 % perl -E '$x = "\N{U+E9}"; require bytes; say bytes::length($x)' 2` [download] As you can plainly see, it’s only your own isolated little byte constants that can switch internal representation. All you have to do is ever once have a code point greater than 255 anywhere in the string and it stops being a byte string. You also won’t have a problem if you’ve read in the utf8 from something whose encoding layer is set to utf8. So if he has either of those in his program — which it looks like he does — he can ignore Chicken Little. It won’t bother him. I’ll bet.	[reply] [d/l] [select]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 06:00 UTC
I don’t know what all that Microsoft noise was for My terminal uses cp437, and the garbage of encoding UTF-8 was there in the OP's output too. It just looks a bit different on my terminal (`'дёе›Ѕ` vs `\x{00c3}\x{0089}`). nor the use utf8 either for that matte Are you suggesting I should have made irrelevant changes to the OP's code? And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed. What do you mean unlikely? I'd say it's impossible since those characters are above U+00FF. But so what. He's not going to deal with only those two characters. I don't get it. In one breath, you say he should handle NFD. In the next, you say I should only concern myself with the characters he posted.	[reply]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:29 UTC
I would agree, the perl implementation is documented to use UTF-8 encoding for one of the two options, and 8-bit chars for the other. It is also explained when each occurs and how they are handled during concatenation, with various options. Certainly is is less problematic and more maintainable to not count on any subtle details that might shift the meaning. Hmm, just what is the 8-bit form? If it's "whatever was read in", it might include characters encoded in multiple bytes, using some other code page. So, I would be inclined to feel safe treating the internal length in bytes as the UTF-8 length if I read in the string from a file using UTF-8 encoding, or it was a string literal in a program whose source file used utf8. I think there is also a utility function somewhere to tell you which mode a string is in. In fact, wouldn't the UTF-8 encoder just check that flag first and realize it's a no-op? So using it would be efficient, if you don't mind copying the string.	[reply]


Perl Monk, Perl Meditation
	PerlMonks

Re^2: Best Way to Get Length of UTF-8 String in Bytes?

Length of text '中国' in characters is 2 Length of text 'дё­е›Ѕ' in bytes is 6 Length of text 'дё­е›Ѕ' in bytes is 6

Length of text '中国' in characters is 2 Length of text 'дёе›Ѕ' in bytes is 6 Length of text 'дёе›Ѕ' in bytes is 6