Locale Responsibilities

aecooper has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm afraid this is more a design question than anything else...

I have written a Perl Gtk+ application that allows a user to browse a Monotone Database (it's a SCM application). It does this by using a subcommand, mtn, that can receive commands on stdin and write back data on stdout.

I have written a OO module, Monotone::AutomateStdio (MAS for short), that provides a nice OO interface to all of Monotone's command line functionality. IPC::Open3 is used to start and access the subprocess, and sysread is used to read it.

Using MAS, I have built up a GUI application that allows the user to browse the database.

As such the database can contain UTF8 data. Not only in the files it contains but also in any change log comments.

Ok, my issue was that the UTF8 data was not displaying correctly under Gtk+. I tried using Encode.pm's decode_utf8() routine and this fixed it.

Some questions if I may:

1) Am I right in assuming that to get the data read into my app from the sub-process and be stamped as UTF8 I would need to tell perl via the PerlIO layer? Otherwise data is read in as binary?

2) Design question. I am thinking that the responsibility for the UTF8 decoding should be in the application rather than the library as one can always fetch binary data from the database (e.g. a jpeg file under cm control), and the app does determine the type of file fetched in order to handle it correctly (i.e. not dumping out binary data to the screen etc). If this raw data was output to the screen then the terminal takes care of it and displays it correctly (I have tested this). Or would people expect the library to put the data into the correct form and stamp it as UTF8?

3) What does decode_utf8 do above check for UTF8 compliance and set the utf8 flag? Does it pack 4 octets per 32 bits for binary and one character per 32 bits for utf8 data.

4) If the data read in is in binary format, then why did I have to use `use bytes' when searching it with an re (including the searching of binary data). At the time this made sense but now I'm having to convert the data to UTF8 I'm wondering well if it isn't already in utf8 then surely it's binary and then why the need for use bytes?

As you can tell I am a bit befuddled by all of this. Can you help?

Many thanks in advance,

Tony.

Comment on Locale Responsibilities

Replies are listed 'Best First'.
Re: Locale Responsibilities by ikegami (Patriarch) on May 17, 2009 at 01:25 UTC
What does decode_utf8 do above check for UTF8 compliance and set the utf8 flag? Does it pack 4 octets per 32 bits for binary and one character per 32 bits for utf8 data. `decode_utf8` converts bytes `"\xC3\xA9"` into character `"\xE9"`. Internally, the string returned is the utf8 representation (a Perl-specific superset of UTF-8) of the character with the `UTF8` flag on. For example, character `"\xE9"` is stored as the two bytes `"\xC3\xA9", UTF8=1`. If the data read in is in binary format, then why did I have to use `use bytes' when searching it with an re (including the searching of binary data). You don't. At the time this made sense but now I'm having to convert the data to UTF8 I'm wondering well if it isn't already in utf8 then surely it's binary and then why the need for use bytes You're unclear as to whether you're talking about the internal or external encoding. Perhaps Re: Decoding, Encoding string, how to? (internal encoding) would help.	[reply] [d/l] [select]
Re^2: Locale Responsibilities by aecooper (Acolyte) on May 24, 2009 at 15:41 UTC
If I don't use the use bytes pragma I get: Malformed UTF-8 character (unexpected non-continuation byte 0xf4, 1 byte after start byte 0xf1, expected 4 bytes) in pattern match (m//) at /home/aecoope/code/monotone.ca/mtn-browse/lib/perl/FindFiles.pm line 601. when searching binary data with an re. Tony.	[reply]
Re^3: Locale Responsibilities by ikegami (Patriarch) on May 25, 2009 at 17:15 UTC
No, your incorrect use of `_utf8_on` (or equivalent such as the `:utf8` PerlIO layer) is causing that. `use bytes` kinda fixes your earlier bug. $ perl -MEncode=_utf8_on -e'$s = "\xF1\xF4"; _utf8_on($s); "" =~ /$s/' Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf1) in regexp compilation at -e line 1. Malformed UTF-8 character (1 byte, need 4, after start byte 0xf4) in r +egexp compilation at -e line 1. Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf1) in regexp compilation at -e line 1. Malformed UTF-8 character (1 byte, need 4, after start byte 0xf4) in r +egexp compilation at -e line 1. $ perl -MEncode=_utf8_on -e'$s = "\xF1\xF4"; "" =~ /$s/' $ [download]	[reply] [d/l] [select]


go ahead... be a heretic
	PerlMonks