Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: reading unicode files

by dakkar (Hermit)
on Mar 13, 2003 at 14:09 UTC ( #242689=note: print w/replies, xml ) Need Help??


in reply to reading unicode files

There is no such thing as "Unicode files".

Unicode is:

  • a list of characters, each of which has a number of properties, such as a name and a number, categories to which it belongs, and so on
  • a series of algorithms (for bidirectionality, collation, normalization, etc etc)
  • some Transfer Formats for serializing character streams
  • and some more thing I forgot...

The things of interest here are the transfer formats: the most known are utf-8 and utf-16:

utf-8
maps the characters defined in ASCII into their usual representation as bytes with values under 128, allowing non-Unicode-aware programs to not make a mess of it (for example, utf-8-encoded strings can still be 0-terminated, the path separator doesn't change, and so on)
utf-16
is more compact for oriental scripts (a kanji in utf-8 can become 4 bytes long, but it's usually 2 bytes long in utf-16), but you can't use 0 to terminate you strings, because for example the character 'A' gets encoded as the two bytes 0 and 65. Moreover, utf-16 is sensitive to endianness

In the following, I'll assume you have utf-8-encoded files.

For Perl 5.005, you just have to handle them as binary files, i.e. you don't have support for Unicode strings.

Perl 5.8 does have support, you just have to tell it which encoding your files are in:

open UTF8FILE,'<:utf8','filename'; while (<UTF8FILE>) { /\p{Devangari}/ and print "A Devangari character!\n"; } close UTF8FILE;
This script would open the file assuming it is in utf-8, and print a message if it finds any character in the Devangari script.

Thing to look at (in the 5.8 docs):

  • perldoc -f open
  • perldoc perluniintro
  • perldoc perlunicode

And also The UTF-8 and Unicode FAQ

-- 
        dakkar - Mobilis in mobile

Replies are listed 'Best First'.
Re: Re: reading unicode files
by Anonymous Monk on Mar 13, 2003 at 15:06 UTC
    dakkar, Firstly, thanks for the prompt reply. Secondly, it would appear my file in question does not appear to be using UTF-8. If I open it with MS Notepad and go to save, the default file type is simply 'unicode', UTF-8 appears a seperate option in the drop-down list. So I am guessing that Notepad may use UTF-16 or UCS-2 when selecting 'Unicode' (not sure what UCS-2 is but read that notepad uses this?). Do you know how I can find out how my file is encoded? I have tried specifying UTF-16 and UCS-2 as the IO layers with the open function but I get: Can't locate PerlIO/UTF.pm

      The easiest (for me) way to decide if your file is utf-16 or ucs-2 (see below) is to look at it, using something like:

      C:\> more < thefile
      If you see something (like smileys, or whitespace) between each latin letter, it's either of the two encodings above, otherwise it isn't (this assumes you have latin letters in your file)

      To read them: (I was not very clear)

      open FILE,'<:encoding(utf-16)','filename';
      or whatever encoding you want. The :utf8 spec is a sort of shorthand for the full :encoding() spec...

      ucs-2 is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with utf-16 for those, so you might not notice the difference. Anyway, don't use to write new files (please ;-) )

      -- 
              dakkar - Mobilis in mobile
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://242689]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2022-05-19 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (71 votes). Check out past polls.

    Notices?