reading unicode files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To the order, I am having a bit of trouble reading unicode files using while <FILEHANDLE> {}. I have tried using ActiveState Perl 5.005 and 5.8.0 without success. I believe the latter supports unicode but I do not understand in what context. If I open the files in question and save in ANSI\ASCII this resolves the problem but this is not a viable workaround for this application. Any advise/code snippets greatly appreciated. Cheers, mmccar

Comment on reading unicode files

Replies are listed 'Best First'.
Re: reading unicode files by dakkar (Hermit) on Mar 13, 2003 at 14:09 UTC
There is no such thing as "Unicode files". Unicode is: a list of characters, each of which has a number of properties, such as a name and a number, categories to which it belongs, and so on a series of algorithms (for bidirectionality, collation, normalization, etc etc) some Transfer Formats for serializing character streams and some more thing I forgot... The things of interest here are the transfer formats: the most known are `utf-8` and `utf-16`: `utf-8` maps the characters defined in ASCII into their usual representation as bytes with values under 128, allowing non-Unicode-aware programs to not make a mess of it (for example, `utf-8`-encoded strings can still be 0-terminated, the path separator doesn't change, and so on) `utf-16` is more compact for oriental scripts (a kanji in `utf-8` can become 4 bytes long, but it's usually 2 bytes long in `utf-16`), but you can't use 0 to terminate you strings, because for example the character 'A' gets encoded as the two bytes 0 and 65. Moreover, `utf-16` is sensitive to endianness In the following, I'll assume you have `utf-8`-encoded files. For Perl 5.005, you just have to handle them as binary files, i.e. you don't have support for Unicode strings. Perl 5.8 does have support, you just have to tell it which encoding your files are in: `open UTF8FILE,'<:utf8','filename'; while (<UTF8FILE>) { /\p{Devangari}/ and print "A Devangari character!\n"; } close UTF8FILE;` [download] This script would open the file assuming it is in `utf-8`, and print a message if it finds any character in the Devangari script. Thing to look at (in the 5.8 docs): `perldoc -f open` `perldoc perluniintro` `perldoc perlunicode` And also The UTF-8 and Unicode FAQ -- dakkar - Mobilis in mobile	[reply] [d/l]
Re: Re: reading unicode files by Anonymous Monk on Mar 13, 2003 at 15:06 UTC
dakkar, Firstly, thanks for the prompt reply. Secondly, it would appear my file in question does not appear to be using UTF-8. If I open it with MS Notepad and go to save, the default file type is simply 'unicode', UTF-8 appears a seperate option in the drop-down list. So I am guessing that Notepad may use UTF-16 or UCS-2 when selecting 'Unicode' (not sure what UCS-2 is but read that notepad uses this?). Do you know how I can find out how my file is encoded? I have tried specifying UTF-16 and UCS-2 as the IO layers with the open function but I get: Can't locate PerlIO/UTF.pm	[reply]
Re: Re: Re: reading unicode files by dakkar (Hermit) on Mar 13, 2003 at 15:18 UTC
The easiest (for me) way to decide if your file is `utf-16` or `ucs-2` (see below) is to look at it, using something like: `C:\> more < thefile` [download] If you see something (like smileys, or whitespace) between each latin letter, it's either of the two encodings above, otherwise it isn't (this assumes you have latin letters in your file) To read them: (I was not very clear) `open FILE,'<:encoding(utf-16)','filename';` [download] or whatever encoding you want. The `:utf8` spec is a sort of shorthand for the full `:encoding()` spec... `ucs-2` is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with `utf-16` for those, so you might not notice the difference. Anyway, don't use to write new files (please `;-)` ) -- dakkar - Mobilis in mobile	[reply] [d/l] [select]
Re: reading unicode files by diotalevi (Canon) on Mar 13, 2003 at 13:50 UTC
There are a series of documentation pages in 5.8.0 that cover this. The best answer I can give you is to start with the main 'man perl' page and go through each of the Unicode documents. Be sure to also visit perlfunc where the utf8 dicipline is applied to a file while opening it. Also visit the binmode() function where the UTF8 mode can be applied to a previously opened file handle. But definately use 5.8. Seeking Green geeks in Minnesota	[reply]

Back to Seekers of Perl Wisdom