Re: Reading CSV Files Containing UTF8 Characters

Replies are listed 'Best First'.
Re^2: Reading CSV Files Containing UTF8 Characters by shoness (Friar) on Nov 08, 2007 at 20:49 UTC
That was easy enough. Thanks! Of course it still doesn't work because some characters don't appear to be UTF-8. I wrote this code to try to figure out what encoding it is, but for any file with the "special" characters I just get the "Didn't work" message. #!/usr/bin/perl use strict; use warnings; use Encode::Guess; undef $/; # slurp on my $dir = '.'; if (@ARGV > 0) { $dir = $ARGV[0]; } opendir DIR, $dir or die "Can't opendir '.': $!\n"; my @files = grep /\.csv$/i, readdir(DIR); closedir DIR; Encode::Guess->add_suspects(qw(latin1 cp1252)); # What else? foreach my $file (@files) { open my $fh, "<:raw", "$dir/$file" or die "Can't open $!\n"; my $data = <$fh>; close $fh; my $enc = guess_encoding($data); if (ref $enc) { print "$file: " . $enc->name . "\n"; } else { print "Didn't work for: $file\n"; } } exit; [download] This file was generated on Windows by exporting from Outlook. The Windows is setup for American English, but the keyboard is Danish. :-/ All the files that DO work are reported with "ascii" encoding (as I expect).	[reply] [d/l]
Re^3: Reading CSV Files Containing UTF8 Characters by graff (Chancellor) on Nov 09, 2007 at 04:31 UTC
Hmm... I thought Outlook was for something like email, so I wonder about the circumstance where it is used to "export" a csv file. If someone emailed you a csv file as an attachment, you would have to hope that the sender can enlighten you as to the character encoding they used. If you can't get that from them, you would have to use Encode::Guess with more possibilities besides cp1252 and "latin1". (Alas, guessing is relatively unreliable when it comes to picking the "right" encoding among the various single-byte-latin alternatives.) Or you'll have to inspect the data file yourself to see if you can deduce what the encoding is. Any decent hex-dump tool would suffice (to see what the byte values are for the non-ascii characters), along with knowledge of the language being used in the text, and some reference info from http://www.unicode.org/Public/MAPPINGS/ (it's an ftp-able directory of mapping tables that relate all the various non-unicode character sets to unicode). My inclination would be: download those unicode mapping tables into a single directory, look at a hex-dump of your csv file to see which non-ascii byte values to look up, figure out what letter each byte value represents, and grep over the mapping tables to find the line that relates that byte value to that letter. The name of the mapping table containing that line represents the character encoding you need to use when opening the csv file.	[reply]


XP is just a number
	PerlMonks