Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^4: How to sanely handle unicode in perl?

by Sec (Monk)
on Mar 20, 2015 at 16:56 UTC ( #1120770=note: print w/replies, xml ) Need Help??

in reply to Re^3: How to sanely handle unicode in perl?
in thread How to sanely handle unicode in perl?

I do not assume unicode. I just want to handle data correctly. perl is apparently unable to output data in the way it's environment requires it to.

The frustrating part is that perl looks like it is equipped to work. It is _able_ to do output conversion on the fly. It is just not able to do it correctly without user intervention.

  • Comment on Re^4: How to sanely handle unicode in perl?

Replies are listed 'Best First'.
Re^5: How to sanely handle unicode in perl?
by Your Mother (Archbishop) on Mar 20, 2015 at 19:10 UTC

    \xc3\xb6 is not the right byte(s) for an Ų from a Latin-1 terminal, it is the UTF-8 encoding. Meaning it can only be issued by a UTF-8 encoded source (and still mean Ų). So what you are asking to do sanely, strikes me asÖstrange. If it is coming from a Latin-1 encoding source it would be \xf6. To do encoding properly you have to know what you are receiving, decode it with that, and know what your output layer is, encode it to that. Itís not easy but itís not magical either. Without the right steps at the right layers itís literally guesswork and impossible to do robustly.

      Please check the source. I explicitly state that the pipe that produces \xc3\xb6 is utf-8. So what you wrote does not apply to my code.

      In fact choroba found out that it works as intended if I prepend ":raw" to the encoding. (Which is unintuitive to me, but kind of makes sense in retrospect)

        Maybe you misunderstand my point. If you run that code in a Latin-1 terminal you are sending UTF-8 and expecting it to act properly. It makes no sense and canít work without goofy and unrealistic hoops.

Re^5: How to sanely handle unicode in perl?
by soonix (Canon) on Mar 21, 2015 at 22:34 UTC
    I do not assume unicde.
    I think you misparsed that sentence
    ďCode that assumes Unicode gives a fig about POSIX locales is broken.Ē
    This is not
    (Code that assumes Unicode) gives a fig about POSIX locales is broken.
    Code that assumes (Unicode gives a fig about POSIX locales) is broken.
    Update: perhaps I should point out that we seem to share the same native language

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1120770]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2022-01-16 20:19 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (49 votes). Check out past polls.