Re: truncate string to byte count

TIMTOWTDI...

use warnings;
use strict;
use Test::More tests=>12;

my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is utf8cut($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is utf8cut($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is utf8cut($in,  6), "\N{U+CF} \N{U+2764}";
is utf8cut($in, $_), "\N{U+CF} " for 5,4,3;
is utf8cut($in,  2), "\N{U+CF}";
is utf8cut($in, $_), "" for 1,0;

sub utf8cut {
    my ($str, $bytelen) = @_;
    utf8::encode($str);
    $str = substr $str, 0, $bytelen;
    $str =~ s/(?: [\xC0-\xDF] | [\xE0-\xEF] [\x80-\xBF]?
        | [\xF0-\xF7] [\x80-\xBF]{0,2} )\z//x;
    utf8::decode($str);
    return $str;
}
[download]

Updates 1 & 2: As per replies, fixed by removing the code which did special handling when !utf8::is_utf8($str) (and the corresponding tests), which I had originally added to the code as an ill-conceived afterthought.

Comment on Re: truncate string to byte count Select or Download Code

Replies are listed 'Best First'.
Re^2: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 20:25 UTC
This `utf8cut` is buggy. It can give suffers from The Unicode Bug. It's output is dependent on how a string is stored internally, which is a bug. For example, passing a string consisting of characters `80` and `80` with a second argument of 2 will can result in `"\x80"` (correct) and `"\x80\x80"` (incorrect).	[reply] [d/l] [select]
Re^3: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 20:54 UTC
For example, passing a string consisting of characters `80` and `80` with a second argument of 2 will can [sic] result in `"\x80"` (correct) and `"\x80\x80"` (incorrect). The way you've worded this makes it sound like the output is not deterministic, which is certainly not the case. Also, "a string consisting of characters `80` and `80`" is not specific enough for a test case. But please feel free to provide some actual test code that demonstrates the bug you are trying to explain, or better yet, show how you would've coded it to (at least in your view) "correctly" handle the different strings `"\x80\x80"` and `"\N{U+80}\N{U+80}"`.	[reply] [d/l] [select]
Re^4: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:09 UTC
But please feel free to provide some actual test code that demonstrates the bug Code that suffers from The Unicode Bug is code that returns different results for equal strings. This is easily demonstrated using the following: `my $s = "\x80\x80"; utf8::upgrade( my $u = $s ); utf8::downgrade( my $d = $s ); is($u, $d); is(utf8cut($u,2), utf8cut($d,2));` [download] better yet, show how you would've coded it to (at least in your view) "correctly" handle the different strings "\x80\x80" and "\N{U+80}\N{U+80}". Perl considers those the same value, and any code that doesn't is by definition suffering from The Unicode Bug.	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks