http://qs321.pair.com?node_id=1230646


in reply to truncate string to byte count

Well, every Utf8-encoded character takes up 16 bits, so you just simply divide by 2 and make sure the result is an even number. If it is not, then subtract one, and then you have an index where it is safe to split the string. I don't understand why is this such a huge problem?

Replies are listed 'Best First'.
Re^2: truncate string to byte count
by Your Mother (Archbishop) on Feb 28, 2019 at 01:55 UTC

    Because that's not remotely right…?

    UTF-8 is a variable length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.
Re^2: truncate string to byte count
by LanX (Saint) on Feb 28, 2019 at 11:50 UTC
    > I don't understand why is this such a huge problem?

    The (text-)string commands in Perl operate on a character and not byte basis. A string carries an internal utf8 flag which determines how it's handled.

    Saying so, some commands like unpack or vec are supposed to operate on raw bit vectors and might be useful here.

    *) i.e. variable byte length character

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re^2: truncate string to byte count
by ikegami (Patriarch) on Feb 28, 2019 at 20:18 UTC

    You might be thinking of UTF-16, but that's also wrong. A character encoded using UTF-16 results in 2 or 4 bytes depending on the character.