http://qs321.pair.com?node_id=718823

andyford has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse some CSV data from a MS SQL dump, but it seems to contain some stuff that I can't see. I tried Text::CSV_XS first, but the parser can't even see the lines apparently. The following just gives no results at all.

use strict; use warnings; use Text::CSV_XS; use Data::Dumper; my $tr_csv = Text::CSV_XS->new({ binary => 1, eol => $/ }); open(my $tr,"<",'file.csv') or die "Failure opening data file: $!"; while (my $row = $tr_csv->getline($tr)) { print $tr_csv->status(); print @$row; print Data::Dumper->Dump($row); }

I tried to step back and just split the lines like so

use strict; use warnings; open my $tr,"<",'file.csv' or die "Failure opening data file: $!"; while (<$tr>) { my ($x,$y,$z) = split /,/; print "x: $x\n"; }
The result there is that a couple unexpected characters (a y-umlaut and a thing that looks kinda like a "p") show up at the beginning of just the first line, even though the data has nothing but ASCII in it. Also, the while loop runs one too many times, as though the file has an empty line at the end, but it doesn't.
x: ÿþDARK01DGBBHF1D x: DARK01JDM0HF1D x: 191WA357Z1F811 x: 1952AF2-L3A3567 x:
I tried to view the non-printing stuff with "set list" in vim, but that just shows the normal "$" eol characters, just one at the end of each line.

Update: solution! Brute force: strip the first two chars and remove all the CR's:

use strict; use warnings; use Text::CSV_XS; use File::Copy; my $tr_csv = Text::CSV_XS->new({ binary => 1, eol => $/ }); open my $tmp,'>','tmp.csv' or die "Failure opening temp file: $!"; open my $tr,'<','file.csv' or die "Failure opening data file: $!"; while (<$tr>) { if ($. == 1) { $_ = substr $_,2; } s/\c@//g; print $tmp $_; } close $tmp or die "Failure closing temp file: $!"; close $tr or die "Failure closing data file: $!"; move 'tmp.csv','file.csv'; open $tr,'<','file.csv' or die "Failure opening data file: $!"; while (my $row = <$tr>) { $tr_csv->parse($row); my ($x,$y,$z) = split /,/,$row; print "x: $x\n"; }

Replies are listed 'Best First'.
Re: Parsing MS SQL CSV export with Text::CSV_XS
by Limbic~Region (Chancellor) on Oct 22, 2008 at 19:19 UTC
    andyford,
    I have seen this very problem and I am willing to bet money you are wrong - there are characters at the beginning of the file that aren't not displayed however you are looking at it. Try this to see for sure:
    #!/usr/bin/perl use strict; use warnings; open(my $fh, '<', 'file.csv') or die $!; { local $/ = \1; for (1 .. 10) { # first 10 bytes of the file my $byte = <$fh>; print join("\t", ord($byte), $byte), "\n"; } }

    Cheers - L~R

        andyford,
        If I remember correctly, there was a way to tell the database to not output those two bytes but I can't remember how. I vaguely recall it had something to do with not telling it you were doing CSV but rather text or perhaps it was just changing the extension from .csv to .txt. Unfortunately, the problem was from a customer providing the data and they could never be bothered to do it consistently so I end up writing something that tested the first two bytes and only stripping them if they were ord() > 127.

        Cheers - L~R

        You have an encoding problem

        open my $fh, "<:encoding(utf16)", "file.csv";

        See for prove:

        $ od -t x1 xx.csv 0000000 ff fe 44 00 41 00 52 00 4b 00 0a 00 $ perl -we'open $a, "<:encoding(utf16)", "xx.csv" and print <$a>' | od + -t x1 0000000 44 41 52 4b 0a $
Re: Parsing MS SQL CSV export with Text::CSV_XS
by Tux (Canon) on Oct 22, 2008 at 19:16 UTC

    If you suspect errors, don't use ->status (), but ->error_diag ().

    If you expect weird characters, you can use my new module Data::Peek:

    use Text::CSV_XS; use Data::Peek; my $csv = Text::CSV_XS->new ({ binary => 1 }); $csv->parse (qq{"ab\nc\x{20ac}\x80"}); print DPeek, "\n" for $csv->fields; $csv->parse (qq{"ab\nc\x{20ac}\x80",\r"\n}); $csv->error_diag;

    Will return

    PV("ab\nc\342\202\254\302\200"\0) [UTF8 "ab\nc\x{20ac}\x{80}"] # CSV_XS ERROR: 2031 - EIF - CR char is first char of field, not part +of EOL

    Enjoy, Have FUN! H.Merijn

      You got me thinking. If I take out the getline and put the parse in the loop and add your error_diag(),

      use strict; use warnings; use Text::CSV_XS; my $tr_csv = Text::CSV_XS->new({ binary => 1, eol => $/ }); open(my $tr,"<",'file.csv') or die "Failure opening ERS data file: $!" +; while (my $row = <$tr>) { $tr_csv->parse($row); print $tr_csv->error_diag(),"\n"; }
      I can get some errors.
      2032EIF - CR char inside unquoted, not part of EOL436 2032EIF - CR char inside unquoted, not part of EOL435 2032EIF - CR char inside unquoted, not part of EOL433 2032EIF - CR char inside unquoted, not part of EOL425 2032EIF - CR char inside unquoted, not part of EOL425
      I guess that means it's worse than just having two extra characters at the beginning.

Re: Parsing MS SQL CSV export with Text::CSV_XS
by jfroebe (Parson) on Oct 22, 2008 at 18:50 UTC

    How exactly are you dumping the data in the first place? Are you using bcp? If so, what parameters are you using?

    Jason L. Froebe

    Blog, Tech Blog

Re: Parsing MS SQL CSV export with Text::CSV_XS
by Illuminatus (Curate) on Oct 22, 2008 at 19:06 UTC
    Emacs does a better job of displaying 'odd' characters than vim. You could also use 'od -h -a' to verify what you actually have. Can you import the file using Excel?

      I actually tried Excel and it gave great results, so at least MS can read the MS output, but then Excel told me it could only display the first 65,536 lines (I have ~160,000). Gave me a chuckle anyway.

      The od output looks intriguing:

      0000000 feff 0044 0041 0052 004b 0030 0031 0044 del ~ D nul A nul R nul K nul 0 nul 1 nul D nu +l
      I guess the "del" and "~" are some other interpretation of my y-umlaut and p-thingy?