http://qs321.pair.com?node_id=242679

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To the order, I am having a bit of trouble reading unicode files using while <FILEHANDLE> {}. I have tried using ActiveState Perl 5.005 and 5.8.0 without success. I believe the latter supports unicode but I do not understand in what context. If I open the files in question and save in ANSI\ASCII this resolves the problem but this is not a viable workaround for this application. Any advise/code snippets greatly appreciated. Cheers, mmccar

Replies are listed 'Best First'.
Re: reading unicode files
by dakkar (Hermit) on Mar 13, 2003 at 14:09 UTC

    There is no such thing as "Unicode files".

    Unicode is:

    • a list of characters, each of which has a number of properties, such as a name and a number, categories to which it belongs, and so on
    • a series of algorithms (for bidirectionality, collation, normalization, etc etc)
    • some Transfer Formats for serializing character streams
    • and some more thing I forgot...

    The things of interest here are the transfer formats: the most known are utf-8 and utf-16:

    utf-8
    maps the characters defined in ASCII into their usual representation as bytes with values under 128, allowing non-Unicode-aware programs to not make a mess of it (for example, utf-8-encoded strings can still be 0-terminated, the path separator doesn't change, and so on)
    utf-16
    is more compact for oriental scripts (a kanji in utf-8 can become 4 bytes long, but it's usually 2 bytes long in utf-16), but you can't use 0 to terminate you strings, because for example the character 'A' gets encoded as the two bytes 0 and 65. Moreover, utf-16 is sensitive to endianness

    In the following, I'll assume you have utf-8-encoded files.

    For Perl 5.005, you just have to handle them as binary files, i.e. you don't have support for Unicode strings.

    Perl 5.8 does have support, you just have to tell it which encoding your files are in:

    open UTF8FILE,'<:utf8','filename'; while (<UTF8FILE>) { /\p{Devangari}/ and print "A Devangari character!\n"; } close UTF8FILE;
    This script would open the file assuming it is in utf-8, and print a message if it finds any character in the Devangari script.

    Thing to look at (in the 5.8 docs):

    • perldoc -f open
    • perldoc perluniintro
    • perldoc perlunicode

    And also The UTF-8 and Unicode FAQ

    -- 
            dakkar - Mobilis in mobile
    
      dakkar, Firstly, thanks for the prompt reply. Secondly, it would appear my file in question does not appear to be using UTF-8. If I open it with MS Notepad and go to save, the default file type is simply 'unicode', UTF-8 appears a seperate option in the drop-down list. So I am guessing that Notepad may use UTF-16 or UCS-2 when selecting 'Unicode' (not sure what UCS-2 is but read that notepad uses this?). Do you know how I can find out how my file is encoded? I have tried specifying UTF-16 and UCS-2 as the IO layers with the open function but I get: Can't locate PerlIO/UTF.pm

        The easiest (for me) way to decide if your file is utf-16 or ucs-2 (see below) is to look at it, using something like:

        C:\> more < thefile
        If you see something (like smileys, or whitespace) between each latin letter, it's either of the two encodings above, otherwise it isn't (this assumes you have latin letters in your file)

        To read them: (I was not very clear)

        open FILE,'<:encoding(utf-16)','filename';
        or whatever encoding you want. The :utf8 spec is a sort of shorthand for the full :encoding() spec...

        ucs-2 is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with utf-16 for those, so you might not notice the difference. Anyway, don't use to write new files (please ;-) )

        -- 
                dakkar - Mobilis in mobile
        
Re: reading unicode files
by diotalevi (Canon) on Mar 13, 2003 at 13:50 UTC

    There are a series of documentation pages in 5.8.0 that cover this. The best answer I can give you is to start with the main 'man perl' page and go through each of the Unicode documents. Be sure to also visit perlfunc where the utf8 dicipline is applied to a file while opening it. Also visit the binmode() function where the UTF8 mode can be applied to a previously opened file handle. But definately use 5.8.


    Seeking Green geeks in Minnesota