Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

reading unicode files

by Anonymous Monk
on Mar 13, 2003 at 13:40 UTC ( #242679=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To the order, I am having a bit of trouble reading unicode files using while <FILEHANDLE> {}. I have tried using ActiveState Perl 5.005 and 5.8.0 without success. I believe the latter supports unicode but I do not understand in what context. If I open the files in question and save in ANSI\ASCII this resolves the problem but this is not a viable workaround for this application. Any advise/code snippets greatly appreciated. Cheers, mmccar

Replies are listed 'Best First'.
Re: reading unicode files
by dakkar (Hermit) on Mar 13, 2003 at 14:09 UTC

    There is no such thing as "Unicode files".

    Unicode is:

    • a list of characters, each of which has a number of properties, such as a name and a number, categories to which it belongs, and so on
    • a series of algorithms (for bidirectionality, collation, normalization, etc etc)
    • some Transfer Formats for serializing character streams
    • and some more thing I forgot...

    The things of interest here are the transfer formats: the most known are utf-8 and utf-16:

    utf-8
    maps the characters defined in ASCII into their usual representation as bytes with values under 128, allowing non-Unicode-aware programs to not make a mess of it (for example, utf-8-encoded strings can still be 0-terminated, the path separator doesn't change, and so on)
    utf-16
    is more compact for oriental scripts (a kanji in utf-8 can become 4 bytes long, but it's usually 2 bytes long in utf-16), but you can't use 0 to terminate you strings, because for example the character 'A' gets encoded as the two bytes 0 and 65. Moreover, utf-16 is sensitive to endianness

    In the following, I'll assume you have utf-8-encoded files.

    For Perl 5.005, you just have to handle them as binary files, i.e. you don't have support for Unicode strings.

    Perl 5.8 does have support, you just have to tell it which encoding your files are in:

    open UTF8FILE,'<:utf8','filename'; while (<UTF8FILE>) { /\p{Devangari}/ and print "A Devangari character!\n"; } close UTF8FILE;
    This script would open the file assuming it is in utf-8, and print a message if it finds any character in the Devangari script.

    Thing to look at (in the 5.8 docs):

    • perldoc -f open
    • perldoc perluniintro
    • perldoc perlunicode

    And also The UTF-8 and Unicode FAQ

    -- 
            dakkar - Mobilis in mobile
    
      dakkar, Firstly, thanks for the prompt reply. Secondly, it would appear my file in question does not appear to be using UTF-8. If I open it with MS Notepad and go to save, the default file type is simply 'unicode', UTF-8 appears a seperate option in the drop-down list. So I am guessing that Notepad may use UTF-16 or UCS-2 when selecting 'Unicode' (not sure what UCS-2 is but read that notepad uses this?). Do you know how I can find out how my file is encoded? I have tried specifying UTF-16 and UCS-2 as the IO layers with the open function but I get: Can't locate PerlIO/UTF.pm

        The easiest (for me) way to decide if your file is utf-16 or ucs-2 (see below) is to look at it, using something like:

        C:\> more < thefile
        If you see something (like smileys, or whitespace) between each latin letter, it's either of the two encodings above, otherwise it isn't (this assumes you have latin letters in your file)

        To read them: (I was not very clear)

        open FILE,'<:encoding(utf-16)','filename';
        or whatever encoding you want. The :utf8 spec is a sort of shorthand for the full :encoding() spec...

        ucs-2 is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with utf-16 for those, so you might not notice the difference. Anyway, don't use to write new files (please ;-) )

        -- 
                dakkar - Mobilis in mobile
        
Re: reading unicode files
by diotalevi (Canon) on Mar 13, 2003 at 13:50 UTC

    There are a series of documentation pages in 5.8.0 that cover this. The best answer I can give you is to start with the main 'man perl' page and go through each of the Unicode documents. Be sure to also visit perlfunc where the utf8 dicipline is applied to a file while opening it. Also visit the binmode() function where the UTF8 mode can be applied to a previously opened file handle. But definately use 5.8.


    Seeking Green geeks in Minnesota

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://242679]
Approved by diotalevi
Front-paged by IlyaM
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2022-08-13 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?