morgon has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
say I have an utf-8 encoded string that I want to truncate to a certain number of bytes.
So I am looking for the longest substring that - when encoded as utf-8 takes less than a certain number of bytes.
What is the best way to do this?
Many thanks!
Re: truncate string to byte count
by vr (Curate) on Feb 27, 2019 at 23:55 UTC
|
I wrote this before more experienced monks said it's unworthy XY-problem (maybe limited length, in bytes, of buffer of some sort?), but let it be FWIW :)
Straightforward one would be:
use strict;
use warnings;
use feature 'say';
use utf8;
use Encode qw/ encode decode _utf8_off /;
my $input = 'Test Ршзефф 号召力打了';
my $byte_limit = 25;
my $limited = decode( 'utf8',
substr(
encode( 'utf8', $input ),
0, $byte_limit ),
Encode::FB_QUIET | Encode::LEAVE_SRC );
binmode STDOUT, 'utf8';
say $limited;
25th byte is in the middle of 3d Chinese character, thus discarded. Obvious complications would be what if some characters can't be present at the end of line (word), what if diacritics (i.e. inseparable parts of graphemes) are thrown out, or invisible things such as joiners are left dangling, etc. Third (unused) import can be used to modify input in-place instead of "encode", e.g. for performance. The LEAVE_SRC is also for phantom of performance, isn't necessary. FB_QUIET returns valid decoded part. | [reply] |
|
Hey, wait! Nobody said it was an "unworthy" post, it was mentioned that it looks like an X/Y problem, which it kind of does.
That's why requests for details were thrown out there.
You're able to answer any way you want. You do not need to precede your answer in such a way... the more experienced Monks love answers that appear to go around the 'norm' :D
(Hell, I will even answer a homework question periodically when I'm bored/angry/frustrated whatever just to get my mind off of things, and sometimes more experienced Monks wouldn't do that even. Each to their own!)
| [reply] |
Re: truncate string to byte count
by haukex (Archbishop) on Feb 28, 2019 at 09:36 UTC
|
use warnings;
use strict;
use Test::More tests=>12;
my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is utf8cut($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is utf8cut($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is utf8cut($in, 6), "\N{U+CF} \N{U+2764}";
is utf8cut($in, $_), "\N{U+CF} " for 5,4,3;
is utf8cut($in, 2), "\N{U+CF}";
is utf8cut($in, $_), "" for 1,0;
sub utf8cut {
my ($str, $bytelen) = @_;
utf8::encode($str);
$str = substr $str, 0, $bytelen;
$str =~ s/(?: [\xC0-\xDF] | [\xE0-\xEF] [\x80-\xBF]?
| [\xF0-\xF7] [\x80-\xBF]{0,2} )\z//x;
utf8::decode($str);
return $str;
}
Updates 1 & 2: As per replies, fixed by removing the code which did special handling when !utf8::is_utf8($str) (and the corresponding tests), which I had originally added to the code as an ill-conceived afterthought. | [reply] [d/l] [select] |
|
This utf8cut is buggy. It can give suffers from The Unicode Bug. It's output is dependent on how a string is stored internally, which is a bug.
For example, passing a string consisting of characters 80 and 80 with a second argument of 2 will can result in "\x80" (correct) and "\x80\x80" (incorrect).
| [reply] [d/l] [select] |
|
For example, passing a string consisting of characters 80 and 80 with a second argument of 2 will can [sic] result in "\x80" (correct) and "\x80\x80" (incorrect).
The way you've worded this makes it sound like the output is not deterministic, which is certainly not the case. Also, "a string consisting of characters 80 and 80" is not specific enough for a test case. But please feel free to provide some actual test code that demonstrates the bug you are trying to explain, or better yet, show how you would've coded it to (at least in your view) "correctly" handle the different strings "\x80\x80" and "\N{U+80}\N{U+80}".
| [reply] [d/l] [select] |
|
Re: truncate string to byte count
by Your Mother (Archbishop) on Feb 28, 2019 at 01:33 UTC
|
First draft. Seems the right idea if not a final. Probably needs some kind of a "binary truncation" to avoid doing a crap ton of work if given a couple megabytes of character data but only looking for the first 255 bytes.
Update, redacted the update, it was wrong and only up for 30 seconds. :P
use 5.16.0;
use strict;
use utf8;
use open ":std", ":encoding(utf8)";
use Encode;
# my $str = "艾捷克, 萨塔尔, 胡西它尔.";
my $str = "艾捷"; # Shortened string for terse output example.
my $max_bytes = shift || length encode(utf8 => $str);
while ( $max_bytes )
{
my $length;
$str =~ s/.\z// while ( $max_bytes < ( $length = length encode(utf8 => $str) ) );
$length ||= 0;
say <<"";
Max -> $max_bytes
Actual -> $length
String -> $str
$max_bytes--;
}
__END__
Max -> 6
Actual -> 6
String -> 艾捷
Max -> 5
Actual -> 3
String -> 艾
Max -> 4
Actual -> 3
String -> 艾
Max -> 3
Actual -> 3
String -> 艾
Max -> 2
Actual -> 0
String ->
Max -> 1
Actual -> 0
String ->
| [reply] |
Re: truncate string to byte count
by hippo (Bishop) on Feb 27, 2019 at 23:32 UTC
|
| [reply] |
|
It is not an XY-Problem.
It's the actual problem at hand.
The backgound is a program that implements a (passive) check for a monitoring system where you need to write your message to a pipe but the monitoring system imposes a byte-limit on the message.
| [reply] |
|
| [reply] [d/l] |
|
| [reply] |
Re: truncate string to byte count
by ikegami (Patriarch) on Feb 28, 2019 at 20:43 UTC
|
sub truncate_utf8 {
my ($utf8, $len) = @_;
$len += 0; # Make sure $len is a number.
return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).*//sr;
}
or
sub truncate_utf8 {
my ($utf8, $len) = @_;
return $utf8 if length($utf8) <= $len;
my $next = substr($utf8, $len, length($utf8)-$len, '');
$next = chop($utf8) while (ord($next) & 0xC0) == 0x80;
return $utf8;
}
Both of these take text that is already encoded using UTF-8.
Update: Fixed typo mentioned by haukex.
Update: Made clear what the input should be. | [reply] [d/l] [select] |
|
use warnings;
use strict;
use Test::More tests=>15;
use open qw/:std :utf8/;
my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}";
is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3;
is truncate_utf8($in, 2), "\N{U+CF}";
is truncate_utf8($in, $_), "" for 1,0;
$in = "\xE4b";
utf8::downgrade($in); # make sure this really is a non-UTF8 string
is truncate_utf8($in, 2), "\xE4b";
is truncate_utf8($in, 1), "\xE4";
is truncate_utf8($in, 0), "";
sub truncate_utf8 {
my ($utf8, $len) = @_;
$len += 0; # Make sure $len is a number.
return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).*//sr;
}
__END__
1..15
ok 1
not ok 2
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 3
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 4
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 5
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 6
# Failed test at x.pl line 10.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤'
not ok 7
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï '
not ok 8
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ '
# expected: 'Ï '
not ok 9
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤'
# expected: 'Ï '
not ok 10
# Failed test at x.pl line 12.
# got: '� '
# expected: '�'
not ok 11
# Failed test at x.pl line 13.
# got: '�'
# expected: ''
ok 12
ok 13
ok 14
ok 15
# Looks like you failed 10 tests of 15.
sub truncate_utf8 {
my ($utf8, $len) = @_;
return $utf8 if length($utf8) <= $len;
my $next = substr($utf8, $len, length($utf8)-$len, '');
$next = chop($utf8) while (ord($next) & 0xC0) == 0x80;
return $utf8;
}
use warnings;
use strict;
use Test::More tests=>15;
use open qw/:std :utf8/;
my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}";
is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3;
is truncate_utf8($in, 2), "\N{U+CF}";
is truncate_utf8($in, $_), "" for 1,0;
$in = "\xE4b";
utf8::downgrade($in); # make sure this really is a non-UTF8 string
is truncate_utf8($in, 2), "\xE4b";
is truncate_utf8($in, 1), "\xE4";
is truncate_utf8($in, 0), "";
sub truncate_utf8 {
my ($utf8, $len) = @_;
return $utf8 if length($utf8) <= $len;
my $next = substr($utf8, $len, length($utf8)-$len, '');
$next = chop($utf8) while (ord($next) & 0xC0) == 0x80;
return $utf8;
}
__END__
1..15
ok 1
not ok 2
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 3
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 4
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 5
# Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤ '
not ok 6
# Failed test at x.pl line 10.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï ❤'
not ok 7
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ 🐪'
# expected: 'Ï '
not ok 8
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤ '
# expected: 'Ï '
not ok 9
# Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
# got: 'Ï ❤'
# expected: 'Ï '
not ok 10
# Failed test at x.pl line 12.
# got: '� '
# expected: '�'
not ok 11
# Failed test at x.pl line 13.
# got: '�'
# expected: ''
ok 12
ok 13
ok 14
ok 15
# Looks like you failed 10 tests of 15.
| [reply] [d/l] [select] |
|
is truncate_utf8($in, ...), "...";
with
is truncate_utf8(encode_utf8($in), ...), encode_utf8("...");
| [reply] [d/l] [select] |
|
|
|
Re: truncate string to byte count
by stevieb (Canon) on Feb 27, 2019 at 23:27 UTC
|
Actual examples please... example strings and "certain number of bytes".
Are you for example trying to carve up a utf-8 string into several blocks of say three bytes each?
| [reply] |
Re: truncate string to byte count
by karlgoethebier (Abbot) on Mar 01, 2019 at 12:44 UTC
|
| [reply] [d/l] |
Re: truncate string to byte count
by harangzsolt33 (Hermit) on Feb 28, 2019 at 01:42 UTC
|
Well, every Utf8-encoded character takes up 16 bits, so you just simply divide by 2 and make sure the result is an even number. If it is not, then subtract one, and then you have an index where it is safe to split the string. I don't understand why is this such a huge problem? | [reply] |
|
| [reply] |
|
> I don't understand why is this such a huge problem?
The (text-)string commands in Perl operate on a character and not byte basis. A string carries an internal utf8 flag which determines how it's handled.
Saying so, some commands like unpack or vec are supposed to operate on raw bit vectors and might be useful here.
*) i.e. variable byte length character
| [reply] |
|
| [reply] |
|
|