truncate string to byte count

morgon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: truncate string to byte count by vr (Curate) on Feb 27, 2019 at 23:55 UTC
I wrote this before more experienced monks said it's unworthy XY-problem (maybe limited length, in bytes, of buffer of some sort?), but let it be FWIW :) Straightforward one would be: use strict; use warnings; use feature 'say'; use utf8; use Encode qw/ encode decode _utf8_off /; my $input = 'Test Ршзефф 号召力打了'; my $byte_limit = 25; my $limited = decode( 'utf8', substr( encode( 'utf8', $input ), 0, $byte_limit ), Encode::FB_QUIET \| Encode::LEAVE_SRC ); binmode STDOUT, 'utf8'; say $limited; 25th byte is in the middle of 3d Chinese character, thus discarded. Obvious complications would be what if some characters can't be present at the end of line (word), what if diacritics (i.e. inseparable parts of graphemes) are thrown out, or invisible things such as joiners are left dangling, etc. Third (unused) import can be used to modify input in-place instead of "encode", e.g. for performance. The LEAVE_SRC is also for phantom of performance, isn't necessary. FB_QUIET returns valid decoded part.	[reply]
Re^2: truncate string to byte count by stevieb (Canon) on Feb 27, 2019 at 23:59 UTC
Hey, wait! Nobody said it was an "unworthy" post, it was mentioned that it looks like an X/Y problem, which it kind of does. That's why requests for details were thrown out there. You're able to answer any way you want. You do not need to precede your answer in such a way... the more experienced Monks love answers that appear to go around the 'norm' :D (Hell, I will even answer a homework question periodically when I'm bored/angry/frustrated whatever just to get my mind off of things, and sometimes more experienced Monks wouldn't do that even. Each to their own!)	[reply]
Re: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 09:36 UTC
TIMTOWTDI... use warnings; use strict; use Test::More tests=>12; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is utf8cut($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is utf8cut($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is utf8cut($in, 6), "\N{U+CF} \N{U+2764}"; is utf8cut($in, $_), "\N{U+CF} " for 5,4,3; is utf8cut($in, 2), "\N{U+CF}"; is utf8cut($in, $_), "" for 1,0; sub utf8cut { my ($str, $bytelen) = @_; utf8::encode($str); $str = substr $str, 0, $bytelen; $str =~ s/(?: [\xC0-\xDF] \| [\xE0-\xEF] [\x80-\xBF]? \| [\xF0-\xF7] [\x80-\xBF]{0,2} )\z//x; utf8::decode($str); return $str; } [download] Updates 1 & 2: As per replies, fixed by removing the code which did special handling when `!utf8::is_utf8($str)` (and the corresponding tests), which I had originally added to the code as an ill-conceived afterthought.	[reply] [d/l] [select]
Re^2: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 20:25 UTC
This `utf8cut` is buggy. It can give suffers from The Unicode Bug. It's output is dependent on how a string is stored internally, which is a bug. For example, passing a string consisting of characters `80` and `80` with a second argument of 2 will can result in `"\x80"` (correct) and `"\x80\x80"` (incorrect).	[reply] [d/l] [select]
Re^3: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 20:54 UTC
For example, passing a string consisting of characters `80` and `80` with a second argument of 2 will can [sic] result in `"\x80"` (correct) and `"\x80\x80"` (incorrect). The way you've worded this makes it sound like the output is not deterministic, which is certainly not the case. Also, "a string consisting of characters `80` and `80`" is not specific enough for a test case. But please feel free to provide some actual test code that demonstrates the bug you are trying to explain, or better yet, show how you would've coded it to (at least in your view) "correctly" handle the different strings `"\x80\x80"` and `"\N{U+80}\N{U+80}"`.	[reply] [d/l] [select]
Re^4: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:09 UTC
Re: truncate string to byte count by Your Mother (Archbishop) on Feb 28, 2019 at 01:33 UTC
First draft. Seems the right idea if not a final. Probably needs some kind of a "binary truncation" to avoid doing a crap ton of work if given a couple megabytes of character data but only looking for the first 255 bytes. Update, redacted the update, it was wrong and only up for 30 seconds. :P use 5.16.0; use strict; use utf8; use open ":std", ":encoding(utf8)"; use Encode; # my $str = "艾捷克, 萨塔尔, 胡西它尔."; my $str = "艾捷"; # Shortened string for terse output example. my $max_bytes = shift \|\| length encode(utf8 => $str); while ( $max_bytes ) { my $length; $str =~ s/.\z// while ( $max_bytes < ( $length = length encode(utf8 => $str) ) ); $length \|\|= 0; say <<""; Max -> $max_bytes Actual -> $length String -> $str $max_bytes--; } __END__ Max -> 6 Actual -> 6 String -> 艾捷 Max -> 5 Actual -> 3 String -> 艾 Max -> 4 Actual -> 3 String -> 艾 Max -> 3 Actual -> 3 String -> 艾 Max -> 2 Actual -> 0 String -> Max -> 1 Actual -> 0 String ->	[reply]
Re: truncate string to byte count by hippo (Bishop) on Feb 27, 2019 at 23:32 UTC
say I have an utf-8 encoded string that I want to truncate to a certain number of bytes. That's an XY Problem if ever I heard one. What are you really trying to do?	[reply]
Re^2: truncate string to byte count by morgon (Priest) on Feb 28, 2019 at 10:02 UTC
It is not an XY-Problem. It's the actual problem at hand. The backgound is a program that implements a (passive) check for a monitoring system where you need to write your message to a pipe but the monitoring system imposes a byte-limit on the message.	[reply]
Re^3: truncate string to byte count by karlgoethebier (Abbot) on Mar 01, 2019 at 10:03 UTC
Sounds like it is Nagios. Can�t you modify the output length of your checks in the database? Regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re^2: truncate string to byte count by stevieb (Canon) on Feb 27, 2019 at 23:34 UTC
That's better than my response (very long day) ;)	[reply]
Re: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 20:43 UTC
A valid cut is one that isn't followed by a continuation byte (0b10xx_xxxx). `sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).//sr; }` [download] or `sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; }` [download] Both of these take text that is already encoded using UTF-8. Update: Fixed typo mentioned by haukex. Update*: Made clear what the input should be.	[reply] [d/l] [select]
Re^2: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 21:04 UTC
`sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len)\K(?![\x80-\xBF]).//sr; }` [download] `Unmatched ) in regex`, and if I fix that: use warnings; use strict; use Test::More tests=>15; use open qw/:std :utf8/; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}"; is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3; is truncate_utf8($in, 2), "\N{U+CF}"; is truncate_utf8($in, $_), "" for 1,0; $in = "\xE4b"; utf8::downgrade($in); # make sure this really is a non-UTF8 string is truncate_utf8($in, 2), "\xE4b"; is truncate_utf8($in, 1), "\xE4"; is truncate_utf8($in, 0), ""; sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).//sr; } __END__ 1..15 ok 1 not ok 2 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 3 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 4 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 5 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 6 # Failed test at x.pl line 10. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤' not ok 7 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ' not ok 8 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ ' # expected: '� ' not ok 9 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤' # expected: '� ' not ok 10 # Failed test at x.pl line 12. # got: '� ' # expected: '�' not ok 11 # Failed test at x.pl line 13. # got: '�' # expected: '' ok 12 ok 13 ok 14 ok 15 # Looks like you failed 10 tests of 15. [download] `sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; }` [download] use warnings; use strict; use Test::More tests=>15; use open qw/:std :utf8/; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}"; is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3; is truncate_utf8($in, 2), "\N{U+CF}"; is truncate_utf8($in, $_), "" for 1,0; $in = "\xE4b"; utf8::downgrade($in); # make sure this really is a non-UTF8 string is truncate_utf8($in, 2), "\xE4b"; is truncate_utf8($in, 1), "\xE4"; is truncate_utf8($in, 0), ""; sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; } __END__ 1..15 ok 1 not ok 2 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 3 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 4 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 5 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤ ' not ok 6 # Failed test at x.pl line 10. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ❤' not ok 7 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ 🐪' # expected: '� ' not ok 8 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤ ' # expected: '� ' not ok 9 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: '� ❤' # expected: '� ' not ok 10 # Failed test at x.pl line 12. # got: '� ' # expected: '�' not ok 11 # Failed test at x.pl line 13. # got: '�' # expected: '' ok 12 ok 13 ok 14 ok 15 # Looks like you failed 10 tests of 15. [download]	[reply] [d/l] [select]
Re^3: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:13 UTC
Tests 1 to 11 are incorrect because they don't provide UTF-8. Replace `is truncate_utf8($in, ...), "...";` [download] with `is truncate_utf8(encode_utf8($in), ...), encode_utf8("...");` [download]	[reply] [d/l] [select]
Re^4: truncate string to byte count by haukex (Archbishop) on Feb 28, 2019 at 21:17 UTC
Re^5: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:19 UTC
Re^5: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 21:25 UTC
Re: truncate string to byte count by stevieb (Canon) on Feb 27, 2019 at 23:27 UTC
Actual examples please... example strings and "certain number of bytes". Are you for example trying to carve up a utf-8 string into several blocks of say three bytes each?	[reply]
Re: truncate string to byte count by karlgoethebier (Abbot) on Mar 01, 2019 at 12:44 UTC
See also. Probably totally useless 😎 �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re: truncate string to byte count by harangzsolt33 (Chaplain) on Feb 28, 2019 at 01:42 UTC
Well, every Utf8-encoded character takes up 16 bits, so you just simply divide by 2 and make sure the result is an even number. If it is not, then subtract one, and then you have an index where it is safe to split the string. I don't understand why is this such a huge problem?	[reply]
Re^2: truncate string to byte count by Your Mother (Archbishop) on Feb 28, 2019 at 01:55 UTC
Because that's not remotely right�? UTF-8 is a variable length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.	[reply]
Re^2: truncate string to byte count by LanX (Saint) on Feb 28, 2019 at 11:50 UTC
> I don't understand why is this such a huge problem? The (text-)string commands in Perl operate on a character and not byte basis. A string carries an internal utf8 flag which determines how it's handled. Saying so, some commands like `unpack` or `vec` are supposed to operate on raw bit vectors and might be useful here. *) i.e. variable byte length character Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^2: truncate string to byte count by ikegami (Patriarch) on Feb 28, 2019 at 20:18 UTC
You might be thinking of UTF-16, but that's also wrong. A character encoded using UTF-16 results in 2 or 4 bytes depending on the character.	[reply]