Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Reading CSV Files Containing UTF8 Characters

by shoness (Friar)
on Nov 08, 2007 at 15:28 UTC ( #649732=perlquestion: print w/replies, xml ) Need Help??

shoness has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've a set of CSV files that contain a scattering of UTF8 characters such as , , etc. I'm using Tie::Handle::CSV to read the data in for processing. I'm using Perl 5.8.8 and running (this) on Windows.
#!/usr/bin/perl use strict; use warnings; use Encode; use Tie::Handle::CSV; my $dir = '.'; if (@ARGV > 0) { $dir = $ARGV[0]; } opendir DIR, $dir or die "Can't opendir '.': $!\n"; my @files = grep /\.csv$/i, readdir(DIR); closedir DIR; foreach my $file (@files) { my $csv = Tie::Handle::CSV->new(file => "$dir/$file"); while (my $line = <$csv>) { # do nothing, just loop... } }
This will spin until hitting a line with an "extended" character, whereupon the CSV.pm will echo the offending line to STDERR and the program will die.

The filehandle is opened within the module, so I can't get at it with binmode(). What am I to do?

Will all this go away if I just move all these files up to Unix where I have Perl 5.8.5 or thereabouts.

As always, your help is appreciated!

Replies are listed 'Best First'.
Re: Reading CSV Files Containing UTF8 Characters
by Anonymous Monk on Nov 08, 2007 at 15:38 UTC
    open_mode => "< :encoding(UTF8)"
      That was easy enough. Thanks! Of course it still doesn't work because some characters don't appear to be UTF-8.

      I wrote this code to try to figure out what encoding it is, but for any file with the "special" characters I just get the "Didn't work" message.

      #!/usr/bin/perl use strict; use warnings; use Encode::Guess; undef $/; # slurp on my $dir = '.'; if (@ARGV > 0) { $dir = $ARGV[0]; } opendir DIR, $dir or die "Can't opendir '.': $!\n"; my @files = grep /\.csv$/i, readdir(DIR); closedir DIR; Encode::Guess->add_suspects(qw(latin1 cp1252)); # What else? foreach my $file (@files) { open my $fh, "<:raw", "$dir/$file" or die "Can't open $!\n"; my $data = <$fh>; close $fh; my $enc = guess_encoding($data); if (ref $enc) { print "$file: " . $enc->name . "\n"; } else { print "Didn't work for: $file\n"; } } exit;
      This file was generated on Windows by exporting from Outlook. The Windows is setup for American English, but the keyboard is Danish. :-/ All the files that DO work are reported with "ascii" encoding (as I expect).
        Hmm... I thought Outlook was for something like email, so I wonder about the circumstance where it is used to "export" a csv file. If someone emailed you a csv file as an attachment, you would have to hope that the sender can enlighten you as to the character encoding they used. If you can't get that from them, you would have to use Encode::Guess with more possibilities besides cp1252 and "latin1". (Alas, guessing is relatively unreliable when it comes to picking the "right" encoding among the various single-byte-latin alternatives.)

        Or you'll have to inspect the data file yourself to see if you can deduce what the encoding is. Any decent hex-dump tool would suffice (to see what the byte values are for the non-ascii characters), along with knowledge of the language being used in the text, and some reference info from http://www.unicode.org/Public/MAPPINGS/ (it's an ftp-able directory of mapping tables that relate all the various non-unicode character sets to unicode).

        My inclination would be: download those unicode mapping tables into a single directory, look at a hex-dump of your csv file to see which non-ascii byte values to look up, figure out what letter each byte value represents, and grep over the mapping tables to find the line that relates that byte value to that letter.

        The name of the mapping table containing that line represents the character encoding you need to use when opening the csv file.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://649732]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2020-10-25 16:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (249 votes). Check out past polls.

    Notices?