Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Read (sysread) binary data into utf8 string

by vr (Curate)
on Apr 03, 2017 at 16:53 UTC ( #1186852=perlquestion: print w/replies, xml ) Need Help??

vr has asked for the wisdom of the Perl Monks concerning the following question:

A "binary" file for us:

C:\>perl -e "print qq(\xB5)" > data.bin


use strict; use warnings; use feature 'say'; use Encode qw/ _utf8_off _utf8_on is_utf8 /; use utf8; use Devel::Peek; my $s1 = ' '; # a space (anything) _utf8_on( $s1 ); # or assign not-ascii above, instead my $s2 = $s1; open my $fh, '<', 'data.bin'; binmode $fh; sysread $fh, $s1, 1; Dump $s1; seek $fh, 0, 0; $s2 = do { local $/; <$fh> }; Dump $s2;
SV = PVMG(0xc149ec) at 0xc20dec REFCNT = 1 FLAGS = (PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0xc15a1c "\302\265"\0 [UTF8 "\x{b5}"] CUR = 2 LEN = 10 MAGIC = 0xc13ffc MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1 SV = PV(0x3f9f6c) at 0xc20f0c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xc2e6a4 "\265"\0 CUR = 1 LEN = 10

Not sure if it's a bug or not.

Note that if the filehandle has been marked as :utf8 , Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters)

Does this imply, that if FH has not been marked, OFFSET is treated as bytes? Then, possibly, utf8 becomes invalid?

I think that if OFFSET was 0, then string utf8-ness should match file's IO encoding layer. I.e., read produces same result as slurping, above. Regardless of content of original scalar. And, if OFFSET was not zero, then? It should be documented more clearly, perhaps. About combinations that should never be used.

BTW, it looks like it's about this bug. Tk passes file name as utf8, this parameter is (rather recklessly) re-used (!) to receive file content.

Replies are listed 'Best First'.
Re: Read (sysread) binary data into utf8 string
by shmem (Chancellor) on Apr 03, 2017 at 23:38 UTC


    shmem [qurx] ~ > perl -CO use Encode qw/ _utf8_on /; use Devel::Peek; $s = " "; _utf8_on( $s ); substr $s, 0, 1, "\x{b5}"; Dump $s; print length $s, $/; print "\$s: '$s'\n"; __END__ SV = PVMG(0xf64ff0) at 0xedd8c8 REFCNT = 1 FLAGS = (SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0xee2620 "\302\265"\0 [UTF8 "\x{b5}"] CUR = 2 LEN = 10 MAGIC = 0xf66010 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1 1 $s: ''

    So this has nothing to do with IO layers; sysread apparently does substr, which just does the right thing. Magic.

    "\x{b5}" is an iso-8859-1 (a.k.a) latin1 char and a valid UTF8 codepoint, whose UTF-8 hex value is 0xC2B5 (0302 0265 as octal) - MICRO SIGN ().

    print chr hex 'b5' eq "\x{00b5}"; __END__ 1

    Your file handle was set to raw, so bytes are read. Since perl places the byte into an UTF-8 string slot, it converts it from the internal representation into UTF8 and happily places its char hex value (2 bytes) into the PV slot, to satisfy the utf8-ness.

    Similar to what happens here (reversed):

    use Encode qw( from_to _utf8_on ); use Devel::Peek; $s = "\x{b5}"; Dump $s; from_to($s, 'latin1', 'utf8'); _utf8_on $s; Dump $s; __END__ SV = PV(0x234fa90) at 0x2375870 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x2372400 "\265"\0 CUR = 1 LEN = 10 COW_REFCNT = 2 SV = PV(0x234fa90) at 0x2375870 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x249b2c0 "\302\265"\0 [UTF8 "\x{b5}"] CUR = 2 LEN = 10

    Except that the 'magic' bits are missing (since none involved).

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Read (sysread) binary data into utf8 string
by ikegami (Pope) on Apr 04, 2017 at 04:39 UTC

    Asa a side note, if you want to switch the storage format of a scalar, best to use bultins utf8::upgrade($s); (switch to UTF8=1 format) or utf8::downgrade($s); (switch to UTF8=0 format) rather than _utf8_on and _utf8_off (which requires extra work to avoid creating bad scalars).

    utf8::upgrade( my $s1 = ' ' );
Re: Read (sysread) binary data into utf8 string
by ikegami (Pope) on Apr 04, 2017 at 04:45 UTC

    [ In the following, a character refers to a string element. For example, string $str has length($str) characters, which can be obtained using substr($str, $index, 1). Whether that character represents a byte, a Unicode Code Point, or something else is of no consequence. ]

    • OFFSET is (unconditionally) a number of characters. Usually, one passes the length of SCALAR.

      sysread($fh, $buf, BLOCK_SIZE, length($buf))
    • LENGTH is (unconditionally) the maximum number of characters to add to SCALAR. For a non-binary handle, this may result in more bytes read than LENGTH.

    • The return value is (unconditionally) the number of characters added to SCALAR. For a non-binary handle, this may be less than the number of bytes read.

Re: Read (sysread) binary data into utf8 string
by vr (Curate) on Apr 04, 2017 at 10:21 UTC

    Maybe there's a problem in Compress::Zlib. Inconsistency, at least

    use strict; use warnings; use feature 'say'; use utf8; use Compress::Zlib; my ( $a, $b, $c, $d, $i ); $a = $b = $c = $d = compress( 'foo' ); utf8::upgrade( $c ); utf8::upgrade( $d ); say 'ok' if $a eq $d; say uncompress( $a ); say uncompress( $c ); $, = ' '; $i = Compress::Zlib::inflateInit(); say $i-> inflate( $b ); $i = Compress::Zlib::inflateInit(); say $i-> inflate( $d );
    ok foo foo foo stream end data error

    P.S. I mean, that's the source of a bug, linked to in OP. Not related to read behavior.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1186852]
Approved by stevieb
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2020-10-30 11:21 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (278 votes). Check out past polls.