Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
You are confused. The only requirement for regexes or any other string operation is that the strings are correctly flagged as utf-8 (internal multi-byte format) or not (internal 1-bit format).

My guess is that your $internal_format_string isn't flagged as utf-8. You can do print utf8::is_utf8($internal_format_string) to check for the utf-8 flag. Note: this says NOTHING about the actual encoding since there is no way to reliably determine the encoding aside from reading this flag. 2)

Also, the way to convert a string that's in utf-8 encoding but not flagged is to use Encode::decode("utf8",$octets);, NOT encode_utf8, since that does exactly the opposite of what you think.

Also also, if you're reading or writing unicode/utf8 data from a handle, you must* set the ":utf8" layer first, using binmode or open. This includes STDIN, STDOUT and STDERR - unless you're using the -C perlrun flag.

If you make sure to set the IO layers correctly you shouldn't have to worry about anything else, though it might still help to upgrade to the latest perl (5.8.8, currently)

* update: well, ok, not MUST, but it does make life a whole lot easier.

2) update 2: in other words, perl relies on you, the programmer, to correctly identify the encoding of any incoming or outgoing string (via IO layers, i.e. binmode() or open() arguments) and literal strings (using the utf8 pragma to signal utf8-encoded scripts).

If you correctly specify those encodings, perl will internally convert those strings to either "utf8" (which is more or less identical to the UTF-8 unicode encoding, at least on non-EBCDIC systems) or whatever default 1-bit encoding your system uses, and it will set a flag for each string to signal which of those two encodings is used.

The intention is that the programmer should normally not have to care at all which of the two encodings is used. All relevant string operations check that 1-bit flag to see how the string(s) in question should be interpreted and return a correctly flagged result in one of the two encodings.

Then, whenever the string is send out to a IO handle, it gets converted to the requested output encoding (see the binmode/open remark above).

Now, the unicode support in perl is relatively new, so there are probably still bugs in it, but most bugs I've seen in real-world programs were due to misunderstanding of the above, directly and wrongly messing with the utf8 flag of strings (see Encode's _utf8_on and _utf8_off if you're curious), using bytes or using old modules that try to handle utf-8 encoded text without setting the utf-8 flag.

Oh, and there are still a few unicode-related bugs in DBD::mysql, but it's getting better :-)


In reply to Re: The unicode / utf8 struggle, part 2: regexes by Joost
in thread The unicode / utf8 struggle, part 2: regexes by isync

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 07:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found