http://qs321.pair.com?node_id=11112591


in reply to Re^2: Lost in encodings
in thread Lost in encodings

And how can you know is console is using utf-8? Could be Windows and CP850

I would't claim I know. But since length 'Kü' is 3 but displays as 'Kü', I just guessed that a multibyte encoding is in place. CP850 is a 1-byte-encoding and should behave differently.

As for Devel::Peek: Those commands will show me the hex codes in ASCII

Devel::Peek will also issue several lines of data which are totally useless unless you're debugging XS code or Perl itself. A decent print unpack 'H*',$data does the same with less fuss.

Replies are listed 'Best First'.
Re^4: Lost in encodings
by Your Mother (Archbishop) on Feb 07, 2020 at 21:22 UTC

    For completeness–

    perl -Mutf8 -CSD -E 'say length "Kü"' # 2
      Yes but only if it's a character-string, i.e. the utf8 flag is set.

      But the OP said the flag is not set.

      edit

      not sure what -CSD means.

      update

      got it perlrun

      The -C flag controls some of the Perl Unicode features.

      As of 5.8.1, the -C can be followed either by a number or a list of option letters. The letters, their numeric values, and effects are as follows; listing the letters is equal to summing the numbers.

      I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        Quite right. Those are the default I/O UTF-8 flags. tchrist included some of them in his recommendations on that monster SO post. Included just as an example of what’s correct. If the length is giving bytes instead of length. It’s broken already and that step, or one before it, is the problem. If the OP included an SSCCE, I’d be more helpful. Possibly… :P

Re^4: Lost in encodings
by LanX (Saint) on Feb 07, 2020 at 21:13 UTC
    It's verbose but will include the utf8 flag plus the dump showing the codepoints in hex. °

    Which is more helpful for us than the OP's copy and paste.

    > A decent print unpack 'H*',$data does the same with less fuss.

    True, but unpack tells me "why" it went wrong? ;)

    Update

    > but displays as 'Kü',

    Provided code areas in the monastery are encoded in utf8. I vividly remember problems here. *

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    *) the monastery is using windows-1252

    °)

    --- Testing: Täst:  "T\xE4st"
    SV = PVMG(0x29c3c98) at 0x29c1fa8
      REFCNT = 1
      FLAGS = (SMG,POK,pIOK,pNOK,pPOK,UTF8)
      IV = 0
      NV = 0
      PV = 0x24c9c68 "T\303\244st"\0 [UTF8 "T\x{e4}st"]
      CUR = 5
      LEN = 10
      MAGIC = 0x2ab1b38
        MG_VIRTUAL = &PL_vtbl_utf8
        MG_TYPE = PERL_MAGIC_utf8(w)
        MG_LEN = 4
    
Re^4: Lost in encodings
by LanX (Saint) on Feb 08, 2020 at 18:15 UTC
    Hi again Harald

    > A decent print unpack 'H*',$data does the same with less fuss.

    Actually, why should I bother to spot the non-ASCII between all the hex-codes? °

    Please compare

    DB<50> $data = 'Künzler' DB<51> print unpack 'H*',$data 4b816e7a6c6572 # ORLY? DB<52> use Data::Dump qw/pp dd/ DB<53> dd $data "K\x81nzler" # <--- DB<54> use Devel::Peek DB<55> Dump $data SV = PVNV(0xd9adb8) at 0x351ac30 REFCNT = 1 FLAGS = (POK,IsCOW,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0x355bcd8 "K\201nzler"\0 # <--- CUR = 7 LEN = 10 COW_REFCNT = 2 DB<56>

    Hint: this time not UTF8, did you notice easily?

    Devel::Peek is core and shows multiple relevant infos in one command.

    It has some minor disadvantages, but if the OP had shown us the output we'd knew immediately that his code is correct, except the debugger settings.

    Telling people explicitly not to use it is pretty surprising ...

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    °) yes I know that ASCII is below 0x80 and how to spot utf8 multi-bytes. But do others?

    And normally I use a water heater when I need tea and don't start to collect decent wood in the forest. ;-)