Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Seeking Perl docs about how UTF8 flag propagates

by demerphq (Chancellor)
on May 17, 2023 at 13:41 UTC ( [id://11152253] : note . print w/replies, xml ) Need Help??


in reply to Seeking Perl docs about how UTF8 flag propagates

My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off. Anything else I consider a bug.

Having said that, it is good form to treat it as an uncertain value and when you need to care you should ensure the variable is the form you want/need. utf8::upgrade() and utf8::downgrade() and related function in Encode are your friend here. Just remember that while utf8::upgrade() should always work, utf8::downgrade() may not be possible and you may want to use utf8::encode() instead, depending on what you are doing.

---
$world=~s/war/peace/g

  • Comment on Re: Seeking Perl docs about how UTF8 flag propagates

Replies are listed 'Best First'.
Re^2: Seeking Perl docs about how UTF8 flag propagates
by ikegami (Patriarch) on May 17, 2023 at 19:18 UTC
    utf8::is_utf8

    Indicates with internal storage format is used by a scalar.

    USE: Debugging XS modules.

    utf8::upgrade

    Changes a scalar to use the upgraded string format (if it's not already) without changing the string.

    my $s = ...; my $t = $s; utf8::upgrade( $t ); say utf8::is_utf8( $t ) ?1:0; # 1 say $s eq $t ?1:0; # 1

    USE: Working around instances of The Unicode Bug.

    utf8::downgrade

    Changes a scalar to use the downgraded string format (if it's not already) without changing the string. Dies if it can't.

    my $s = ...; my $t = $s; utf8::downgrade( $t ); # Might croak say utf8::is_utf8( $t ) ?1:0; # 0 say $s eq $t ?1:0; # 1

    USE: Working around instances of The Unicode Bug.

    utf8::encode

    Encodes a string using utf8.

    Expects a string of arbitrary characters in either storage format.

    Produces a string of 8-bit characters in the downgraded format.

    USE: You should probably be encoding using the standard UTF-8 encoding instead of the Perl-specific utf8 encoding.

    utf8::decode

    Decodes a string encoded using utf8. Dies if it can't.

    Expects a string of 8-bit characters in either storage format.

    Produces a string of characters in the upgraded format.

    USE: utf8 is a Perl-specific encoding. Are sure the text isn't encode using the standard UTF-8 encoding?

    Encode::is_utf8

    Indicates with internal storage format is used by a scalar.

    USE: You might as well use the equivalent built-in utf8::is_utf8.

    Encode::_utf8_on

    Mostly equivalent to the following:

    utf8::decode( $_ ) if !utf8::is_utf8( $_ );

    The difference is that it produces a corrupt scalar if the string isn't valid utf8.

    USE: Do not use as it introduces The Unicode Bug.

    Encode::_utf8_off

    Equivalent to the following:

    utf8::encode( $_ ) if utf8::is_utf8( $_ );

    USE: Do not use as it introduces The Unicode Bug.

Re^2: Seeking Perl docs about how UTF8 flag propagates
by raygun (Scribe) on May 17, 2023 at 21:15 UTC
    My understanding has always been that the substring of a string should share its utf8'ness, and that once turned on it stays on until explicitly turned off.

    It makes sense to me that once turned on in a particular string, it should stay on (e.g., if the string is modified via $str =~ s///). Functions that give you substrings (e.g., substr, split) create new strings for these, so there is no "staying on" to be done. The flag value would have to be intentionally propagated from one string to another.

    Even a basic assignment creates a new string, but one would hope one of the properties of an assignment operator is that it duplicates both a variable's data and its metadata. (Yet even this fairly straightforward fact is not documented in perlop.)

    Anything else I consider a bug.
    By that logic, the behavior hv points out in Re^7: Seeking Perl docs about how UTF8 flag propagates is a bug. But since no documentation supports your expectation, I'm not sure you could make a case for that.
    Having said that, it is good form to treat it as an uncertain value
    Yeah, that's what I'm doing now, in response to this thread. Thanks to everyone who's chimed in.