Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

problems parsing CSV

by helenwoodson (Acolyte)
on Oct 09, 2010 at 13:37 UTC ( [id://864380]=perlquestion: print w/replies, xml ) Need Help??

helenwoodson has asked for the wisdom of the Perl Monks concerning the following question:

I'm rather new to Perl, having only had an introductory course - and now I'm doing some scripts for a friend. It's been, ah, educational.

The input was a couple of files with comma-delimited records, only some of the fields had embedded commas and were enclosed in double quotes. I'd never heard of CSV, so I reinvented the wheel and got it working. I later discovered Text::CSV, figured out how to install it - and that worked on my test data - but it choked on the diacritics when I ran it on the million-plus records for the real deal. I poked around and found Text::Undiacritic and attempted to install it, then poked around some more and figured out how to force the install. Whew.

That took care of the diacritics. Now it's choking on records that look something like this:
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""

I think the problem is in the third field - it's enclosed in double quotes and contains a parenthesized portion which in turn contains a double-quoted string.

Worst case, I could fall back on my reinvention of the wheel, but that would be cheating.

I'd appreciate very much any help.

Replies are listed 'Best First'.
Re: problems parsing CSV
by Corion (Patriarch) on Oct 09, 2010 at 14:10 UTC

    Personally, I would try to clean up the problem at the source first, but I'm well aware that this is not always possible.

    You can identify your stray double quotes, as they are (likely) not preceded or followed by a comma. Also, the number of elements per row should remain constant. This will make it fairly easy to write a cleanup program, or at least a program that flags the invalid lines for manual correction:

    #!perl -w my $columncount; while (<>) { print "Line $. has invalid quotes: $_" if (/[^,]"[^,]/); # Let's try to fix that line automatically: s/([^,])"([^,])/$1""$2/g; my @columns = /([^",]+)|"([^"]+|"")"/g; $columncount ||= @columns; if ($columncount <> 0+@columns) { print "Line $. has weird commas and needs even more manual int +ervention."; }; };
Re: problems parsing CSV
by Khen1950fx (Canon) on Oct 09, 2010 at 21:18 UTC
    I used parser-xs.pl from Text::CSV_XS to check the data. I put the csv in a file on the desktop called "check_this":
    #!perl # This script can be used as a base to parse unreliable CSV streams # Modify to your own needs # # (m)'08 [23 Apr 2008] Copyright H.M.Brand 2008-2010 use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, blank_is_undef => 1, eol => $/, }); my $csa = Text::CSV_XS->new ({ binary => 1, allow_loose_quotes => 1, blank_is_undef => 1, escape_char => undef, }); my $file = @ARGV ? shift : "/root/Desktop/check_this"; open my $fh, "<", $file or die "$file: $!\n"; my %err_eol = map { $_ => 1 } 2010, 2027, 2031, 2032; print STDERR "Reading $file with Text::CSV_XS $Text::CSV_XS::VERSION\n +"; while (1) { my $row = $csv->getline ($fh); unless ($row) { # Parsing failed # Could be end of file $csv->eof and last; # Diagnose and show what was wrong my @diag = $csv->error_diag; print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n"; my $ep = $diag[2] - 1; # diag[2] is 1-based my $ein = $csv->error_input; # The line scanned so far my $err = $ein . " "; substr $err, $ep + 1, 0, "*"; # Bad character marked between ** substr $err, $ep, 0, "*"; ($err = substr $err, $ep - 5, 12) =~ s/ +$//; print STDERR " |$err|\n"; REPARSE: { # Now retry with allowed options if ($csa->parse ($ein)) { print STDERR "Accepted in allow mode ...\n"; $row = [ $csa->fields ]; } else { # Still fails my @diag = $csa->error_diag; if (exists $err_eol{$diag[0]}) { # \r or \n inside field print STDERR " Extending line with next chunk\n"; $ein .= scalar <$fh>; goto REPARSE; } print STDERR " Also could not parse it in allow mode\n"; print STDERR " $./$diag[2] - $diag[0] - $diag[1]\n"; print STDERR " Line skipped\n"; next; } } } # Data was fine, print data properly quoted $csv->print (*STDOUT, $row); }
Re: problems parsing CSV
by Jenda (Abbot) on Oct 09, 2010 at 23:30 UTC

    Others had already commented on the stray quotes problem, I'll get back to the diacritics ... you probably want to enable "binary" data in the fields. See the "binary" attribute in the Text::CSV docs.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: problems parsing CSV
by Anonymous Monk on Oct 09, 2010 at 13:56 UTC
    Looks like the GIGO principle at work
    #!/usr/bin/perl -- use strict; use warnings; use Text::CSV; use Data::Dump::Streamer; Main(@ARGV); exit(0); sub Main { my $csv = Text::CSV->new; warn my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3. +0,""!; $csv->parse($str) and Dump( [ $csv->fields ] ); $csv->combine( qw' stuff more', 'foo (1994 "bar" only)', 1234, 1988, 3.0, "" ) and warn $csv->string; } __END__ stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line +14. stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 16.
      allow_loose_quotes / allow_loose_escapes seems to fix it
      #!/usr/bin/perl -- use strict; use warnings; use Text::CSV; use Data::Dump::Streamer; Main(@ARGV); exit(0); sub Main { #~ my $csv = Text::CSV->new ( { auto_diag => 1, } ); my $csv = Text::CSV->new ( { auto_diag => 1, allow_loose_quotes = +> 1, allow_loose_escapes => 1,} ); warn "012345678901234567890123456789\n", my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""! +; $csv->parse($str) and Dump([ $csv->fields ]); $csv->combine(qw' stuff more', 'foo (1994 "bar" only)', 1234, 1988 +,3.0,"") and warn $csv->string; } __END__ 012345678901234567890123456789 stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line +15. $ARRAY1 = [ 'stuff', 'more', 'foo (1994 "bar" only)', 1234, 1988, '3.0', '' ]; stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 18.
Re: problems parsing CSV
by ww (Archbishop) on Oct 09, 2010 at 14:26 UTC
    May we see your version of CSV?

    Writing one is a fairly complex challenge; since you wrote one that apparently worked (within limits), comparing your work to Text::CSV might be enlightening.

      Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it.

      I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly.

      The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow.

      I've simplified the code to show the parsing attempt.

      #!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) || die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/\(\d{4} and after \"acid aerosols\" only\)//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

      Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this:

      Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""

      I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.

        Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason.

        What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know.

        I suspect your few problems — malformed CSV records and text that isn't in the ASCII character encoding — can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended.

        UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.

        I refactored your script. Most of the changes I made are ones recommended in Perl Best Practices by Damian Conway. The revised script is untested and intended merely to suggest coding improvements.

        #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ allow_loose_quotes => 1, binary => 1, }); CSV_RECORD: while (my $csv_record = <$release_fh>) { # Skip header... next CSV_RECORD if $INPUT_LINE_NUMBER == 1; # SPECIAL CASE: Remove unwanted text... $csv_record =~ s/ \(\d{4} and after "acid aerosols" only\)//g; if (not $csv->parse($csv_record)) { my $csv_error_input = $csv->error_input(); warn "Can't parse CSV record at line $INPUT_LINE_NUMBER" . " of release file $release_file:\n$csv_error_input\n"; next CSV_RECORD; } # TRI, Release#, ChemName, RegNum, Year, Pounds, Grams my ($tri, undef, $chem_name, undef, $year, $pounds, $grams) = $csv->fields(); # [Do what you need to do with the data here] } close $release_fh; exit 0;

        I moved the Text::CSV->new() constructor outside the while loop. I moved the lexical variables inside the while loop. I added a space to the beginning of the regular expression pattern matching the string being globally removed from the CSV records.

        I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:

        while <RLS> ) { # $_ = undiacritic($_); # s/\(\d{4} and after \"acid aerosols\" only\)//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

        allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().

        Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.

        Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:

        if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }

        and then this test

          if( !$lbs && !$gms )

        gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.

        Thanks so much for your help.

      How's this? And no, I haven't tested it exhaustively nor compared it to Text::CSV.

      #!/usr/bin/perl
      
      use strict;
      use warnings;
      
      local $/;
      my $data = <DATA>;
      
      while( $data =~ m{ (?: \" ( .*? (?: \"\" .*? )*? ) \" | ( [^,]*? ) ) ( \, | \n ) }msxg ){
        my $item = $1;
        if( not defined $item ){
          $item = $2;
        }
        my $nl = $3;
      
        print "\t«$item»\n";
        if( $nl eq "\n" ){
          print '-' x 40, "\n";
        }
      }
      
      __DATA__
      stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""
      a,b,c,
      x,y
      "a","foo""bar",test
      a,"foo"bar",test
      </code>

      My ignorance here is exceeded only by my humility. I'm afraid I don't understand the question.

      If you mean you want to see the code I wrote using Text::CSV and the attributes I set for it, that's in a code snippit I posted a little while ago.

      If that's not what you mean and you will explain it to me, I'll post whatever it is if I can figure it out

Re: problems parsing CSV
by dorward (Curate) on Oct 09, 2010 at 20:00 UTC
Re: problems parsing CSV
by snape (Pilgrim) on Oct 10, 2010 at 00:32 UTC

    I think if you have the records in a constant format as you described earlier then you should use split command to remove the commas and double quotes and store it in an array. Please look the command functionalities at split perl

    You then can use the array elements accordingly.

    #!/usr/bin/perl use strict; use warnings; open IN,"csv.txt" or die $!; while(<IN>){ chomp($_); my @arr = split(/,|"|\s/,$_); print $arr[0],"\t",$arr[1],"\n",$arr[2],"\n",$arr[3],"\n",$arr[4], +"\n",$arr[5],"\n",$arr[6],"\n",$arr[7],"\n",$arr[8],"\n",$arr[9],"\n" +,$arr[10],"\n",$arr[11],"\n",$arr[12],"\n"; print @arr; }

    Run the program and you will understand what happens with the use of the split command and how it splits the entire statement

Re: problems parsing CSV
by Marshall (Canon) on Oct 10, 2010 at 09:55 UTC
    This CSV (Comma Separated Value) format is not as easy as it seems. I did some testing with Excel and also found this article on wikipedia: CSV format.

    To "escape" a "(double quote) in the CSV format you put another " (double quote) in front of it (or behind it) - who is to say which one is which - but two means one.

    I am not a regex guru, but there are folks here who are. The below shows what needs to be done in this particular case. More general solutions are possible.

    #!/usr/bin/perl -w use strict; =doc stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" should be: stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,"" or could also be: stuff,more,foo (1994 "bar" only),1234,1988,3.0,"" =cut my $x = 'stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""'; $x =~ s/ "/ ""/g; $x =~ s/" /"" /g; print $x; # prints: # stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,""
Re: problems parsing CSV
by sundialsvc4 (Abbot) on Oct 10, 2010 at 14:18 UTC

    As you well know, there’s a “Catch-22” here.   With over a million records to process, there is no practical way to hand-inspect every one.   Your efforts to make the data “able to be processed” can also, if you are not cautious, enable incorrect data to be processed ... or for data to be processed incorrectly such that “you would never know.”

    (It’s the quintessential “Type-1 / Type-2 Errors” principle from Stats class.   In situations such as these, “incorrectly accepting ‘false’ data” is by far the more damaging, because if the computer itself does not catch the error, no one will.)

    Consider writing a defensive, suspicious process that vets the data first, line-by-line, resolving valid issues and throwing erroneous records out (into a separate “garbage bucket” file).   If any records get thrown-out in this way, stop and scream.   Otherwise, process the records ... using CPAN modules to your best advantage as you are able.

    Like all contributed code, CPAN modules are (to some degree) “designed for the general case,” and sometimes that is a good thing but sometimes it does not work out so well.   As a last resort (so to speak), sometimes CPAN modules work well as “a source of inspiration.”   And that is a legitimate purpose, too.

Re: problems parsing CSV
by snape (Pilgrim) on Oct 10, 2010 at 00:30 UTC

    Update: I apologize for wasting the space. Please refer to the write up below.

    Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://864380]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-19 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found