http://qs321.pair.com?node_id=1230633

morgon has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

say I have an utf-8 encoded string that I want to truncate to a certain number of bytes.

So I am looking for the longest substring that - when encoded as utf-8 takes less than a certain number of bytes.

What is the best way to do this?

Many thanks!

Replies are listed 'Best First'.
Re: truncate string to byte count
by vr (Curate) on Feb 27, 2019 at 23:55 UTC

    I wrote this before more experienced monks said it's unworthy XY-problem (maybe limited length, in bytes, of buffer of some sort?), but let it be FWIW :)

    Straightforward one would be:

    use strict;
    use warnings;
    use feature 'say';
    
    use utf8;
    use Encode qw/ encode decode _utf8_off /;
    
    my $input = 'Test Ршзефф 号召力打了';
    my $byte_limit = 25;
    
    my $limited = decode( 'utf8',
        substr( 
            encode( 'utf8', $input ), 
        0, $byte_limit ),
    Encode::FB_QUIET | Encode::LEAVE_SRC );
        
    binmode STDOUT, 'utf8';
    say $limited;
    

    25th byte is in the middle of 3d Chinese character, thus discarded. Obvious complications would be what if some characters can't be present at the end of line (word), what if diacritics (i.e. inseparable parts of graphemes) are thrown out, or invisible things such as joiners are left dangling, etc. Third (unused) import can be used to modify input in-place instead of "encode", e.g. for performance. The LEAVE_SRC is also for phantom of performance, isn't necessary. FB_QUIET returns valid decoded part.

      Hey, wait! Nobody said it was an "unworthy" post, it was mentioned that it looks like an X/Y problem, which it kind of does.

      That's why requests for details were thrown out there.

      You're able to answer any way you want. You do not need to precede your answer in such a way... the more experienced Monks love answers that appear to go around the 'norm' :D

      (Hell, I will even answer a homework question periodically when I'm bored/angry/frustrated whatever just to get my mind off of things, and sometimes more experienced Monks wouldn't do that even. Each to their own!)

Re: truncate string to byte count
by haukex (Archbishop) on Feb 28, 2019 at 09:36 UTC

    TIMTOWTDI...

    use warnings; use strict; use Test::More tests=>12; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is utf8cut($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is utf8cut($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is utf8cut($in, 6), "\N{U+CF} \N{U+2764}"; is utf8cut($in, $_), "\N{U+CF} " for 5,4,3; is utf8cut($in, 2), "\N{U+CF}"; is utf8cut($in, $_), "" for 1,0; sub utf8cut { my ($str, $bytelen) = @_; utf8::encode($str); $str = substr $str, 0, $bytelen; $str =~ s/(?: [\xC0-\xDF] | [\xE0-\xEF] [\x80-\xBF]? | [\xF0-\xF7] [\x80-\xBF]{0,2} )\z//x; utf8::decode($str); return $str; }

    Updates 1 & 2: As per replies, fixed by removing the code which did special handling when !utf8::is_utf8($str) (and the corresponding tests), which I had originally added to the code as an ill-conceived afterthought.

      This utf8cut is buggy. It can give suffers from The Unicode Bug. It's output is dependent on how a string is stored internally, which is a bug.

      For example, passing a string consisting of characters 80 and 80 with a second argument of 2 will can result in "\x80" (correct) and "\x80\x80" (incorrect).

        For example, passing a string consisting of characters 80 and 80 with a second argument of 2 will can [sic] result in "\x80" (correct) and "\x80\x80" (incorrect).

        The way you've worded this makes it sound like the output is not deterministic, which is certainly not the case. Also, "a string consisting of characters 80 and 80" is not specific enough for a test case. But please feel free to provide some actual test code that demonstrates the bug you are trying to explain, or better yet, show how you would've coded it to (at least in your view) "correctly" handle the different strings "\x80\x80" and "\N{U+80}\N{U+80}".

Re: truncate string to byte count
by Your Mother (Archbishop) on Feb 28, 2019 at 01:33 UTC

    First draft. Seems the right idea if not a final. Probably needs some kind of a "binary truncation" to avoid doing a crap ton of work if given a couple megabytes of character data but only looking for the first 255 bytes.

    Update, redacted the update, it was wrong and only up for 30 seconds. :P

    use 5.16.0;
    use strict;
    use utf8;
    use open ":std", ":encoding(utf8)";
    use Encode;
    
    # my $str = "艾捷克, 萨塔尔, 胡西它尔.";
    my $str = "艾捷"; # Shortened string  for terse output example.
    
    my $max_bytes = shift || length encode(utf8 => $str);
    
    while ( $max_bytes )
    {
        my $length;
        $str =~ s/.\z// while ( $max_bytes < ( $length = length encode(utf8 => $str) ) );
        $length ||= 0;
        say <<"";
            Max -> $max_bytes
         Actual -> $length
         String -> $str
    
         $max_bytes--;
    }
    __END__
            Max -> 6
         Actual -> 6
         String -> 艾捷
    
            Max -> 5
         Actual -> 3
         String -> 艾
    
            Max -> 4
         Actual -> 3
         String -> 艾
    
            Max -> 3
         Actual -> 3
         String -> 艾
    
            Max -> 2
         Actual -> 0
         String -> 
    
            Max -> 1
         Actual -> 0
         String -> 
    
Re: truncate string to byte count
by hippo (Bishop) on Feb 27, 2019 at 23:32 UTC
    say I have an utf-8 encoded string that I want to truncate to a certain number of bytes.

    That's an XY Problem if ever I heard one. What are you really trying to do?

      It is not an XY-Problem.

      It's the actual problem at hand.

      The backgound is a program that implements a (passive) check for a monitoring system where you need to write your message to a pipe but the monitoring system imposes a byte-limit on the message.

        Sounds like it is Nagios. Can‘t you modify the output length of your checks in the database? Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      That's better than my response (very long day) ;)

Re: truncate string to byte count
by ikegami (Patriarch) on Feb 28, 2019 at 20:43 UTC

    A valid cut is one that isn't followed by a continuation byte (0b10xx_xxxx).

    sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).*//sr; }
    or
    sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; }

    Both of these take text that is already encoded using UTF-8.

    Update: Fixed typo mentioned by haukex.
    Update: Made clear what the input should be.

      sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len)\K(?![\x80-\xBF]).*//sr; }

      Unmatched ) in regex, and if I fix that:

      use warnings; use strict; use Test::More tests=>15; use open qw/:std :utf8/; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}"; is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3; is truncate_utf8($in, 2), "\N{U+CF}"; is truncate_utf8($in, $_), "" for 1,0; $in = "\xE4b"; utf8::downgrade($in); # make sure this really is a non-UTF8 string is truncate_utf8($in, 2), "\xE4b"; is truncate_utf8($in, 1), "\xE4"; is truncate_utf8($in, 0), ""; sub truncate_utf8 { my ($utf8, $len) = @_; $len += 0; # Make sure $len is a number. return $utf8 =~ s/^.{0,$len}\K(?![\x80-\xBF]).*//sr; } __END__ 1..15 ok 1 not ok 2 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 3 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 4 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 5 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 6 # Failed test at x.pl line 10. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084;' not ok 7 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï ' not ok 8 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; ' # expected: 'Ï ' not ok 9 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084;' # expected: 'Ï ' not ok 10 # Failed test at x.pl line 12. # got: '&#65533; ' # expected: '&#65533;' not ok 11 # Failed test at x.pl line 13. # got: '&#65533;' # expected: '' ok 12 ok 13 ok 14 ok 15 # Looks like you failed 10 tests of 15.
      sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; }
      use warnings; use strict; use Test::More tests=>15; use open qw/:std :utf8/; my $in = "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, 11), "\N{U+CF} \N{U+2764} \N{U+1F42A}"; is truncate_utf8($in, $_), "\N{U+CF} \N{U+2764} " for 10,9,8,7; is truncate_utf8($in, 6), "\N{U+CF} \N{U+2764}"; is truncate_utf8($in, $_), "\N{U+CF} " for 5,4,3; is truncate_utf8($in, 2), "\N{U+CF}"; is truncate_utf8($in, $_), "" for 1,0; $in = "\xE4b"; utf8::downgrade($in); # make sure this really is a non-UTF8 string is truncate_utf8($in, 2), "\xE4b"; is truncate_utf8($in, 1), "\xE4"; is truncate_utf8($in, 0), ""; sub truncate_utf8 { my ($utf8, $len) = @_; return $utf8 if length($utf8) <= $len; my $next = substr($utf8, $len, length($utf8)-$len, ''); $next = chop($utf8) while (ord($next) & 0xC0) == 0x80; return $utf8; } __END__ 1..15 ok 1 not ok 2 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 3 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 4 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 5 # Failed test at x.pl line 9. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084; ' not ok 6 # Failed test at x.pl line 10. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï &#10084;' not ok 7 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; &#128042;' # expected: 'Ï ' not ok 8 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084; ' # expected: 'Ï ' not ok 9 # Failed test at x.pl line 11. Wide character in print at /opt/perl5.28/lib/site_perl/5.28.1/Test2/Fo +rmatter/TAP.pm line 112. # got: 'Ï &#10084;' # expected: 'Ï ' not ok 10 # Failed test at x.pl line 12. # got: '&#65533; ' # expected: '&#65533;' not ok 11 # Failed test at x.pl line 13. # got: '&#65533;' # expected: '' ok 12 ok 13 ok 14 ok 15 # Looks like you failed 10 tests of 15.

        Tests 1 to 11 are incorrect because they don't provide UTF-8. Replace

        is truncate_utf8($in, ...), "...";
        with
        is truncate_utf8(encode_utf8($in), ...), encode_utf8("...");
Re: truncate string to byte count
by stevieb (Canon) on Feb 27, 2019 at 23:27 UTC

    Actual examples please... example strings and "certain number of bytes".

    Are you for example trying to carve up a utf-8 string into several blocks of say three bytes each?

Re: truncate string to byte count
by karlgoethebier (Abbot) on Mar 01, 2019 at 12:44 UTC

    See also. Probably totally useless 😎

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: truncate string to byte count
by harangzsolt33 (Chaplain) on Feb 28, 2019 at 01:42 UTC
    Well, every Utf8-encoded character takes up 16 bits, so you just simply divide by 2 and make sure the result is an even number. If it is not, then subtract one, and then you have an index where it is safe to split the string. I don't understand why is this such a huge problem?

      Because that's not remotely right…?

      UTF-8 is a variable length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.
      > I don't understand why is this such a huge problem?

      The (text-)string commands in Perl operate on a character and not byte basis. A string carries an internal utf8 flag which determines how it's handled.

      Saying so, some commands like unpack or vec are supposed to operate on raw bit vectors and might be useful here.

      *) i.e. variable byte length character

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      You might be thinking of UTF-16, but that's also wrong. A character encoded using UTF-16 results in 2 or 4 bytes depending on the character.