Re^2: truncate string to byte count

sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len)\K(?![\x80-\xBF]).*//sr; }
[download]

Unmatched ) in regex, and if I fix that:

use warnings;
use strict;
use Test::More tests=>15;
use open qw/:std :utf8/;

my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is truncate_utf8($in,  6), "\N{U+CF} \N{U+2764}";
is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3;
is truncate_utf8($in,  2), "\N{U+CF}";
is truncate_utf8($in, $_), "" for 1,0;

$in = "\xE4b";
utf8::downgrade($in); # make sure this really is a non-UTF8 string
is truncate_utf8($in, 2), "\xE4b";
is truncate_utf8($in, 1), "\xE4";
is truncate_utf8($in, 0), "";

sub truncate_utf8 {
   my ($utf8, $len) = @_;
   $len += 0;  # Make sure $len is a number.
   return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).*//sr;
}

__END__

1..15
ok 1
not ok 2
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 3
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 4
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 5
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 6
#   Failed test at x.pl line 10.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084;'
not ok 7
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī '
not ok 8
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; '
#     expected: 'Ī '
not ok 9
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084;'
#     expected: 'Ī '
not ok 10
#   Failed test at x.pl line 12.
#          got: '&#65533; '
#     expected: '&#65533;'
not ok 11
#   Failed test at x.pl line 13.
#          got: '&#65533;'
#     expected: ''
ok 12
ok 13
ok 14
ok 15
# Looks like you failed 10 tests of 15.
[download]

sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; }
[download]

use warnings;
use strict;
use Test::More tests=>15;
use open qw/:std :utf8/;

my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}";
is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7;
is truncate_utf8($in,  6), "\N{U+CF} \N{U+2764}";
is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3;
is truncate_utf8($in,  2), "\N{U+CF}";
is truncate_utf8($in, $_), "" for 1,0;

$in = "\xE4b";
utf8::downgrade($in); # make sure this really is a non-UTF8 string
is truncate_utf8($in, 2), "\xE4b";
is truncate_utf8($in, 1), "\xE4";
is truncate_utf8($in, 0), "";

sub truncate_utf8 {
   my ($utf8, $len) = @_;
   return $utf8 if length($utf8) <= $len;
   my $next = substr($utf8, $len, length($utf8)-$len, '');
   $next = chop($utf8) while (ord($next) & 0xC0) == 0x80;
   return $utf8;
}

__END__

1..15
ok 1
not ok 2
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 3
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 4
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 5
#   Failed test at x.pl line 9.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084; '
not ok 6
#   Failed test at x.pl line 10.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī &#10084;'
not ok 7
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; &#128042;'
#     expected: 'Ī '
not ok 8
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084; '
#     expected: 'Ī '
not ok 9
#   Failed test at x.pl line 11.
Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo
+rmatter/TAP.pm line 112.
#          got: 'Ī &#10084;'
#     expected: 'Ī '
not ok 10
#   Failed test at x.pl line 12.
#          got: '&#65533; '
#     expected: '&#65533;'
not ok 11
#   Failed test at x.pl line 13.
#          got: '&#65533;'
#     expected: ''
ok 12
ok 13
ok 14
ok 15
# Looks like you failed 10 tests of 15.
[download]

Comment on Re^2: truncate string to byte count Select or Download Code

Replies are listed 'Best First'.
Re^3: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:13 UTC
Tests 1 to 11 are incorrect because they don't provide UTF-8. Replace `is truncate_utf8($in, ...), "...";` [download] with `is truncate_utf8(encode_utf8($in), ...), encode_utf8("...");` [download]	[reply] [d/l] [select]
Re^4: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 21:17 UTC
Tests 1 to 12 are incorrect because they don't provide UTF-8. I think you're making assumptions about what the OP's input data looks like... Update: Tests 1 to 11 ... Replace ... It is uncool to update a node in a way that renders replies confusing or meaningless. Update 2: How about "is is uncool to make someone chase down all of your ninja edits and wonder when the ninja editing will be done." You made several other ninja edits (like this one) to your nodes and I had to update my replies several times while composing them.	[reply]
Re^5: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:19 UTC
No. "say I have an utf-8 encoded string"	[reply]
Re^5: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:25 UTC
1. There was no reply. 2. Neither noticing that test 12 is an empty string nor adding a bug fix to the bug explanation render the future reply confusing or meaningless.	[reply]


Syntactic Confectionery Delight
	PerlMonks