Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Locale Responsibilities

by aecooper (Acolyte)
on May 16, 2009 at 17:58 UTC ( [id://764436]=perlquestion: print w/replies, xml ) Need Help??

aecooper has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm afraid this is more a design question than anything else...

I have written a Perl Gtk+ application that allows a user to browse a Monotone Database (it's a SCM application). It does this by using a subcommand, mtn, that can receive commands on stdin and write back data on stdout.

I have written a OO module, Monotone::AutomateStdio (MAS for short), that provides a nice OO interface to all of Monotone's command line functionality. IPC::Open3 is used to start and access the subprocess, and sysread is used to read it.

Using MAS, I have built up a GUI application that allows the user to browse the database.

As such the database can contain UTF8 data. Not only in the files it contains but also in any change log comments.

Ok, my issue was that the UTF8 data was not displaying correctly under Gtk+. I tried using Encode.pm's decode_utf8() routine and this fixed it.

Some questions if I may:

1) Am I right in assuming that to get the data read into my app from the sub-process and be stamped as UTF8 I would need to tell perl via the PerlIO layer? Otherwise data is read in as binary?

2) Design question. I am thinking that the responsibility for the UTF8 decoding should be in the application rather than the library as one can always fetch binary data from the database (e.g. a jpeg file under cm control), and the app does determine the type of file fetched in order to handle it correctly (i.e. not dumping out binary data to the screen etc). If this raw data was output to the screen then the terminal takes care of it and displays it correctly (I have tested this). Or would people expect the library to put the data into the correct form and stamp it as UTF8?

3) What does decode_utf8 do above check for UTF8 compliance and set the utf8 flag? Does it pack 4 octets per 32 bits for binary and one character per 32 bits for utf8 data.

4) If the data read in is in binary format, then why did I have to use `use bytes' when searching it with an re (including the searching of binary data). At the time this made sense but now I'm having to convert the data to UTF8 I'm wondering well if it isn't already in utf8 then surely it's binary and then why the need for use bytes?

As you can tell I am a bit befuddled by all of this. Can you help?

Many thanks in advance,

Tony.

Replies are listed 'Best First'.
Re: Locale Responsibilities
by ikegami (Patriarch) on May 17, 2009 at 01:25 UTC

    What does decode_utf8 do above check for UTF8 compliance and set the utf8 flag? Does it pack 4 octets per 32 bits for binary and one character per 32 bits for utf8 data.

    decode_utf8 converts bytes "\xC3\xA9" into character "\xE9".

    Internally, the string returned is the utf8 representation (a Perl-specific superset of UTF-8) of the character with the UTF8 flag on. For example, character "\xE9" is stored as the two bytes "\xC3\xA9", UTF8=1.

    If the data read in is in binary format, then why did I have to use `use bytes' when searching it with an re (including the searching of binary data).

    You don't.

    At the time this made sense but now I'm having to convert the data to UTF8 I'm wondering well if it isn't already in utf8 then surely it's binary and then why the need for use bytes

    You're unclear as to whether you're talking about the internal or external encoding. Perhaps Re: Decoding, Encoding string, how to? (internal encoding) would help.

      If I don't use the use bytes pragma I get:

      Malformed UTF-8 character (unexpected non-continuation byte 0xf4, 1 byte after start byte 0xf1, expected 4 bytes) in pattern match (m//) at /home/aecoope/code/monotone.ca/mtn-browse/lib/perl/FindFiles.pm line 601.

      when searching binary data with an re.

      Tony.

        No, your incorrect use of _utf8_on (or equivalent such as the :utf8 PerlIO layer) is causing that. use bytes kinda fixes your earlier bug.
        $ perl -MEncode=_utf8_on -e'$s = "\xF1\xF4"; _utf8_on($s); "" =~ /$s/' Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf1) in regexp compilation at -e line 1. Malformed UTF-8 character (1 byte, need 4, after start byte 0xf4) in r +egexp compilation at -e line 1. Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf1) in regexp compilation at -e line 1. Malformed UTF-8 character (1 byte, need 4, after start byte 0xf4) in r +egexp compilation at -e line 1. $ perl -MEncode=_utf8_on -e'$s = "\xF1\xF4"; "" =~ /$s/' $

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://764436]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-23 17:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found