Re: Re: reading unicode files

in reply to Re: reading unicode files
in thread reading unicode files

dakkar, Firstly, thanks for the prompt reply. Secondly, it would appear my file in question does not appear to be using UTF-8. If I open it with MS Notepad and go to save, the default file type is simply 'unicode', UTF-8 appears a seperate option in the drop-down list. So I am guessing that Notepad may use UTF-16 or UCS-2 when selecting 'Unicode' (not sure what UCS-2 is but read that notepad uses this?). Do you know how I can find out how my file is encoded? I have tried specifying UTF-16 and UCS-2 as the IO layers with the open function but I get: Can't locate PerlIO/UTF.pm

Comment on Re: Re: reading unicode files

Replies are listed 'Best First'.
Re: Re: Re: reading unicode files by dakkar (Hermit) on Mar 13, 2003 at 15:18 UTC
The easiest (for me) way to decide if your file is `utf-16` or `ucs-2` (see below) is to look at it, using something like: `C:\> more < thefile` [download] If you see something (like smileys, or whitespace) between each latin letter, it's either of the two encodings above, otherwise it isn't (this assumes you have latin letters in your file) To read them: (I was not very clear) `open FILE,'<:encoding(utf-16)','filename';` [download] or whatever encoding you want. The `:utf8` spec is a sort of shorthand for the full `:encoding()` spec... `ucs-2` is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with `utf-16` for those, so you might not notice the difference. Anyway, don't use to write new files (please `;-)` ) -- dakkar - Mobilis in mobile	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: reading unicode files
by dakkar (Hermit) on Mar 13, 2003 at 15:18 UTC

The easiest (for me) way to decide if your file is utf-16 or ucs-2 (see below) is to look at it, using something like:

C:\> more < thefile
[download]

To read them: (I was not very clear)

open FILE,'<:encoding(utf-16)','filename';
[download]

:utf8

:encoding()

ucs-2 is a degenerate form of Unicode encoding, since it can not represent character beyond the first 2^16. It is more-or-less compatible with utf-16 for those, so you might not notice the difference. Anyway, don't use to write new files (please ;-) )

-- 
        dakkar - Mobilis in mobile

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom