Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^6: Standard handles inherited from a utf-8 enabled shell

by BrowserUk (Patriarch)
on Mar 21, 2012 at 19:24 UTC ( [id://960858]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Standard handles inherited from a utf-8 enabled shell
in thread Standard handles inherited from a utf-8 enabled shell

I won't believe this until I've seen it, reproduced as a minimal example

As I said: agreed. But can you think of anything else that might fit with the symptoms described and the apparent solution?

I couldn't, and all my attempts to try and re-create the situation also failed:

perl -CO -e" system q[ \perl64\bin\perl.exe -e\" print pack 'B8', '111 +11111'; \" | od -t x1 ]" 0000000 ff 0000001 perl -CO -e" system q[\perl64\bin\perl.exe -CO -e\"print pack 'B8', '1 +1111111'; \" | od -t x1 ]" 0000000 c3 bf 0000002

I would have expected the first of those to produce the same od output as the second, had the second instance of perl inherited the stdout characteristics of its parent.

But I'm on windows, and disproving the possibility here, doesn't disprove it for other platforms, hence my asking the question.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^7: Standard handles inherited from a utf-8 enabled shell
by repellent (Priest) on Mar 22, 2012 at 04:41 UTC
    That's not how I see it. I see the system-ed perl as an autonomous process (unknowing of its parent process) with its STDOUT filehandle set with different encodings.

    In both cases, we're printing out a string with one character at codepoint U+00FF.

    The second system-ed perl has its output encoding set to UTF-8 (via -CO). What octets do we send out into the cruel world for U+00FF character encoded in UTF-8? Ans: c3 bf.

    The first system-ed perl has its output "set" to byte/Latin-1 encoding (the default). What octets do we send out into the cruel world for U+00FF character encoded in Latin-1? Ans: ff.

    The first case did not print c3 bf just because of the parent perl -CO because the system print did not go through the parent's perlio.
      I see the system-ed perl as an autonomous process (unknowing of its parent process)

      As you are probably aware, system is equivalent to fork followed by exec.

      You are also probably aware tha fork preserves open file descriptors. This is why to create a daemon, it is necessary to fork twice. You fork once, close the standard handles in the child; and then fork a second time. Only then does the second child become disassociated with the terminal and a true daemon.

      What you may not be familiar with is that (various forms of) exec are front end for execve. And that execve() also preserves open file descriptors. (Except those marked close-on-exec.)

      To quote the above man page:By default, file descriptors remain open across an execve().

      You can prove this to yourself. Run this one-liner (suitably adjusted):

      perl -e"system qq[ $^X -e\"\$n=123; print \$n\" ];" 123

      And you'll see the output 123

      Now try this modified version:

      C:\test>perl -e"close STDOUT; system qq[ $^X -e\"$n=123; print \$n\" ] +;"

      Where did the output disappear to?

      So bang goes the autonomous process theory.

      In both cases, we're printing out a string with one character at codepoint U+00FF.

      No. The return value from pack 'B8', ... is not a character; nor a codepoint; and absolutely nothing to do with Unicode.

      It is a byte! An 8-bit unsigned number bit pattern stored in a 8-bit unit of memory and nothing else.

      No interpretation of the meaning (nor even signedness) is placed (nor could be) upon that value until you do something with it!

      The second system-ed perl has its output set ...

      You're right that the interpretation applied to the 8-bit value is not preserved across the fork/exec pair, but not because of your reasoning.

      The important part is that the OS cannot preserve what it has no knowledge of. There is no concept of encoding attached to the file descriptors.

      It is also likely, though I haven't confirmed this, that Perl reopens the standard handles when it starts.

      The bottom line -- for this thread, rather than this subthread -- is that the OP must have omitted some details from his scenario.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        You mentioned (emphasis mine):

          No interpretation of the meaning (nor even signedness) is placed (nor could be) upon that number until you do something with it!

        I agree with this, but I believe we have different assumptions on what is meant by interpretation. Look, I need a way to refer to that number, because that is fundamental. I call that number a "character". The value of that number is what I call the "codepoint value". Bear with me: forget "Unicode" for now, and grant me the use of those words. At any time, you may s/character|codepoint/_that_number_/gi.

        Before that sentence, you mentioned:

          It is a byte! An 8-bit bit pattern stored in a 8-bit unit of memory and nothing else.

        Well, that number is 255 == ord(pack 'B8', '11111111'). Saying it's a (single) byte means you've established the number of bits for it is 8. That, to me, is giving the number an interpretation(*). This observation is very important when it comes to the subject of encoding, especially when we're to print that character (i.e. that number).

        If you want to print a string, you should avoid any preconceived notion of how many bits the string "has" prior to deciding which encoding to use. I find thinking in terms of characters (i.e. those numbers) and what their codepoint values (i.e. the number values) are, helps tremendously in my handling of strings up to the point where they are encoded using print. That is my thought process, and the message I was trying to deliver.

        (*) I am aware of the details of how perl stores that number in memory, but not as well versed as you. I would like to reiterate that this discussion is about print and encoding, and that the ordinal of the character is what matters here.
          The important part is that the OS cannot preserve what it has no knowledge of.

        Agreed.

          There is no concept of encoding attached to the file descriptors.

        And that's the thing: the concept of encoding alone does not make sense without the concept of characters (what we're encoding). And those characters can only exist within the process (e.g. numbers in Perl's "string"). Our computer "systems" (e.g. web browser, text editor, terminal, program, etc.) do this decode-incoming-octets-then-output-octets-already-encoded dance between each other to handoff characters.

        When Perl warns you about "Wide character in print", what it's really saying is: Please be explicit about the encoding so that I can tell the next "system" about my characters accurately, using only octets.

          The bottom line -- for this thread, rather than this subthread -- is that the OP must have omitted some details from his scenario.

        Agreed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://960858]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (11)
As of 2024-03-28 09:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found