Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Reading CSV Files Containing UTF8 Characters

by Anonymous Monk
on Nov 08, 2007 at 15:38 UTC ( [id://649733]=note: print w/replies, xml ) Need Help??


in reply to Reading CSV Files Containing UTF8 Characters

open_mode => "< :encoding(UTF8)"
  • Comment on Re: Reading CSV Files Containing UTF8 Characters

Replies are listed 'Best First'.
Re^2: Reading CSV Files Containing UTF8 Characters
by shoness (Friar) on Nov 08, 2007 at 20:49 UTC
    That was easy enough. Thanks! Of course it still doesn't work because some characters don't appear to be UTF-8.

    I wrote this code to try to figure out what encoding it is, but for any file with the "special" characters I just get the "Didn't work" message.

    #!/usr/bin/perl use strict; use warnings; use Encode::Guess; undef $/; # slurp on my $dir = '.'; if (@ARGV > 0) { $dir = $ARGV[0]; } opendir DIR, $dir or die "Can't opendir '.': $!\n"; my @files = grep /\.csv$/i, readdir(DIR); closedir DIR; Encode::Guess->add_suspects(qw(latin1 cp1252)); # What else? foreach my $file (@files) { open my $fh, "<:raw", "$dir/$file" or die "Can't open $!\n"; my $data = <$fh>; close $fh; my $enc = guess_encoding($data); if (ref $enc) { print "$file: " . $enc->name . "\n"; } else { print "Didn't work for: $file\n"; } } exit;
    This file was generated on Windows by exporting from Outlook. The Windows is setup for American English, but the keyboard is Danish. :-/ All the files that DO work are reported with "ascii" encoding (as I expect).
      Hmm... I thought Outlook was for something like email, so I wonder about the circumstance where it is used to "export" a csv file. If someone emailed you a csv file as an attachment, you would have to hope that the sender can enlighten you as to the character encoding they used. If you can't get that from them, you would have to use Encode::Guess with more possibilities besides cp1252 and "latin1". (Alas, guessing is relatively unreliable when it comes to picking the "right" encoding among the various single-byte-latin alternatives.)

      Or you'll have to inspect the data file yourself to see if you can deduce what the encoding is. Any decent hex-dump tool would suffice (to see what the byte values are for the non-ascii characters), along with knowledge of the language being used in the text, and some reference info from http://www.unicode.org/Public/MAPPINGS/ (it's an ftp-able directory of mapping tables that relate all the various non-unicode character sets to unicode).

      My inclination would be: download those unicode mapping tables into a single directory, look at a hex-dump of your csv file to see which non-ascii byte values to look up, figure out what letter each byte value represents, and grep over the mapping tables to find the line that relates that byte value to that letter.

      The name of the mapping table containing that line represents the character encoding you need to use when opening the csv file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://649733]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-04-24 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found