problems parsing CSV

helenwoodson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: problems parsing CSV by Corion (Patriarch) on Oct 09, 2010 at 14:10 UTC
Personally, I would try to clean up the problem at the source first, but I'm well aware that this is not always possible. You can identify your stray double quotes, as they are (likely) not preceded or followed by a comma. Also, the number of elements per row should remain constant. This will make it fairly easy to write a cleanup program, or at least a program that flags the invalid lines for manual correction: `#!perl -w my $columncount; while (<>) { print "Line $. has invalid quotes: $_" if (/[^,]"[^,]/); # Let's try to fix that line automatically: s/([^,])"([^,])/$1""$2/g; my @columns = /([^",]+)\|"([^"]+\|"")"/g; $columncount \|\|= @columns; if ($columncount <> 0+@columns) { print "Line $. has weird commas and needs even more manual int +ervention."; }; };` [download]	[reply] [d/l]
Re: problems parsing CSV by Khen1950fx (Canon) on Oct 09, 2010 at 21:18 UTC
I used parser-xs.pl from Text::CSV_XS to check the data. I put the csv in a file on the desktop called "check_this": #!perl # This script can be used as a base to parse unreliable CSV streams # Modify to your own needs # # (m)'08 [23 Apr 2008] Copyright H.M.Brand 2008-2010 use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, blank_is_undef => 1, eol => $/, }); my $csa = Text::CSV_XS->new ({ binary => 1, allow_loose_quotes => 1, blank_is_undef => 1, escape_char => undef, }); my $file = @ARGV ? shift : "/root/Desktop/check_this"; open my $fh, "<", $file or die "$file: $!\n"; my %err_eol = map { $_ => 1 } 2010, 2027, 2031, 2032; print STDERR "Reading $file with Text::CSV_XS $Text::CSV_XS::VERSION\n +"; while (1) { my $row = $csv->getline ($fh); unless ($row) { # Parsing failed # Could be end of file $csv->eof and last; # Diagnose and show what was wrong my @diag = $csv->error_diag; print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n"; my $ep = $diag[2] - 1; # diag[2] is 1-based my $ein = $csv->error_input; # The line scanned so far my $err = $ein . " "; substr $err, $ep + 1, 0, ""; # Bad character marked between * substr $err, $ep, 0, ""; ($err = substr $err, $ep - 5, 12) =~ s/ +$//; print STDERR " \|$err\|\n"; REPARSE: { # Now retry with allowed options if ($csa->parse ($ein)) { print STDERR "Accepted in allow mode ...\n"; $row = [ $csa->fields ]; } else { # Still fails my @diag = $csa->error_diag; if (exists $err_eol{$diag[0]}) { # \r or \n inside field print STDERR " Extending line with next chunk\n"; $ein .= scalar <$fh>; goto REPARSE; } print STDERR " Also could not parse it in allow mode\n"; print STDERR " $./$diag[2] - $diag[0] - $diag[1]\n"; print STDERR " Line skipped\n"; next; } } } # Data was fine, print data properly quoted $csv->print (STDOUT, $row); } [download]	[reply] [d/l]
Re: problems parsing CSV by Jenda (Abbot) on Oct 09, 2010 at 23:30 UTC
Others had already commented on the stray quotes problem, I'll get back to the diacritics ... you probably want to enable "binary" data in the fields. See the "binary" attribute in the Text::CSV docs. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: problems parsing CSV by Anonymous Monk on Oct 09, 2010 at 13:56 UTC
Looks like the GIGO principle at work #!/usr/bin/perl -- use strict; use warnings; use Text::CSV; use Data::Dump::Streamer; Main(@ARGV); exit(0); sub Main { my $csv = Text::CSV->new; warn my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3. +0,""!; $csv->parse($str) and Dump( [ $csv->fields ] ); $csv->combine( qw' stuff more', 'foo (1994 "bar" only)', 1234, 1988, 3.0, "" ) and warn $csv->string; } __END__ stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line +14. stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 16. [download]	[reply] [d/l]
Re^2: problems parsing CSV by Anonymous Monk on Oct 09, 2010 at 14:22 UTC
allow_loose_quotes / allow_loose_escapes seems to fix it #!/usr/bin/perl -- use strict; use warnings; use Text::CSV; use Data::Dump::Streamer; Main(@ARGV); exit(0); sub Main { #~ my $csv = Text::CSV->new ( { auto_diag => 1, } ); my $csv = Text::CSV->new ( { auto_diag => 1, allow_loose_quotes = +> 1, allow_loose_escapes => 1,} ); warn "012345678901234567890123456789\n", my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""! +; $csv->parse($str) and Dump([ $csv->fields ]); $csv->combine(qw' stuff more', 'foo (1994 "bar" only)', 1234, 1988 +,3.0,"") and warn $csv->string; } __END__ 012345678901234567890123456789 stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line +15. $ARRAY1 = [ 'stuff', 'more', 'foo (1994 "bar" only)', 1234, 1988, '3.0', '' ]; stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 18. [download]	[reply] [d/l]
Re: problems parsing CSV by ww (Archbishop) on Oct 09, 2010 at 14:26 UTC
May we see your version of CSV? Writing one is a fairly complex challenge; since you wrote one that apparently worked (within limits), comparing your work to Text::CSV might be enlightening.	[reply]
Re^2: problems parsing CSV by helenwoodson (Acolyte) on Oct 09, 2010 at 23:54 UTC
Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it. I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly. The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow. I've simplified the code to show the parsing attempt. #!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) \|\| die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/$\d{4} and after \"acid aerosols\" only$//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS); [download] Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this: `Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""` [download] I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.	[reply] [d/l] [select]
Re^3: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 04:02 UTC
Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason. What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know. I suspect your few problems � malformed CSV records and text that isn't in the ASCII character encoding � can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended. UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.	[reply]
Re^4: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 10:56 UTC
Re^3: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 05:13 UTC
I refactored your script. Most of the changes I made are ones recommended in Perl Best Practices by Damian Conway. The revised script is untested and intended merely to suggest coding improvements. #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ allow_loose_quotes => 1, binary => 1, }); CSV_RECORD: while (my $csv_record = <$release_fh>) { # Skip header... next CSV_RECORD if $INPUT_LINE_NUMBER == 1; # SPECIAL CASE: Remove unwanted text... $csv_record =~ s/ $\d{4} and after "acid aerosols" only$//g; if (not $csv->parse($csv_record)) { my $csv_error_input = $csv->error_input(); warn "Can't parse CSV record at line $INPUT_LINE_NUMBER" . " of release file $release_file:\n$csv_error_input\n"; next CSV_RECORD; } # TRI, Release#, ChemName, RegNum, Year, Pounds, Grams my ($tri, undef, $chem_name, undef, $year, $pounds, $grams) = $csv->fields(); # [Do what you need to do with the data here] } close $release_fh; exit 0; [download] I moved the Text::CSV->new() constructor outside the while loop. I moved the lexical variables inside the while loop. I added a space to the beginning of the regular expression pattern matching the string being globally removed from the CSV records.	[reply] [d/l]
Re^3: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 06:12 UTC
I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out: `while <RLS> ) { # $_ = undiacritic($_); # s/$\d{4} and after \"acid aerosols\" only$//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);` [download] allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic(). Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice. Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this: `if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }` [download] and then this test `if( !$lbs && !$gms )` gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not. Thanks so much for your help.	[reply] [d/l] [select]
Re^4: problems parsing CSV by Tux (Canon) on Oct 10, 2010 at 10:30 UTC
Re^4: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 23:46 UTC
Re^4: problems parsing CSV by Jim (Curate) on Oct 11, 2010 at 05:51 UTC
Re^5: problems parsing CSV by Tux (Canon) on Oct 11, 2010 at 06:40 UTC
Some notes below your chosen depth have not been shown here
Re^2: problems parsing CSV by shawnhcorey (Friar) on Oct 09, 2010 at 19:11 UTC
How's this? And no, I haven't tested it exhaustively nor compared it to `Text::CSV`. #!/usr/bin/perl use strict; use warnings; local $/; my $data = <DATA>; while( $data =~ m{ (?: \" ( .? (?: \"\" .? )? ) \" \| ( [^,]? ) ) ( \, \| \n ) }msxg ){ my $item = $1; if( not defined $item ){ $item = $2; } my $nl = $3; print "\t�$item�\n"; if( $nl eq "\n" ){ print '-' x 40, "\n"; } } __DATA__ stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" a,b,c, x,y "a","foo""bar",test a,"foo"bar",test </code>	[reply] [d/l]
Re^2: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 11:08 UTC
My ignorance here is exceeded only by my humility. I'm afraid I don't understand the question. If you mean you want to see the code I wrote using Text::CSV and the attributes I set for it, that's in a code snippit I posted a little while ago. If that's not what you mean and you will explain it to me, I'll post whatever it is if I can figure it out	[reply]
Re: problems parsing CSV by dorward (Curate) on Oct 09, 2010 at 20:00 UTC
I think you want `allow_loose_quotes`.	[reply] [d/l]
Re: problems parsing CSV by snape (Pilgrim) on Oct 10, 2010 at 00:32 UTC
I think if you have the records in a constant format as you described earlier then you should use split command to remove the commas and double quotes and store it in an array. Please look the command functionalities at split perl You then can use the array elements accordingly. `#!/usr/bin/perl use strict; use warnings; open IN,"csv.txt" or die $!; while(<IN>){ chomp($_); my @arr = split(/,\|"\|\s/,$_); print $arr[0],"\t",$arr[1],"\n",$arr[2],"\n",$arr[3],"\n",$arr[4], +"\n",$arr[5],"\n",$arr[6],"\n",$arr[7],"\n",$arr[8],"\n",$arr[9],"\n" +,$arr[10],"\n",$arr[11],"\n",$arr[12],"\n"; print @arr; }` [download] Run the program and you will understand what happens with the use of the split command and how it splits the entire statement	[reply] [d/l]
Re: problems parsing CSV by Marshall (Canon) on Oct 10, 2010 at 09:55 UTC
This CSV (Comma Separated Value) format is not as easy as it seems. I did some testing with Excel and also found this article on wikipedia: CSV format. To "escape" a "(double quote) in the CSV format you put another " (double quote) in front of it (or behind it) - who is to say which one is which - but two means one. I am not a regex guru, but there are folks here who are. The below shows what needs to be done in this particular case. More general solutions are possible. `#!/usr/bin/perl -w use strict; =doc stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" should be: stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,"" or could also be: stuff,more,foo (1994 "bar" only),1234,1988,3.0,"" =cut my $x = 'stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""'; $x =~ s/ "/ ""/g; $x =~ s/" /"" /g; print $x; # prints: # stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,""` [download]	[reply] [d/l]
Re: problems parsing CSV by sundialsvc4 (Abbot) on Oct 10, 2010 at 14:18 UTC
As you well know, there’s a “Catch-22” here. With over a million records to process, there is no practical way to hand-inspect every one. Your efforts to make the data “able to be processed” can also, if you are not cautious, enable incorrect data to be processed ... or for data to be processed incorrectly such that “you would never know.” (It’s the quintessential “Type-1 / Type-2 Errors” principle from Stats class. In situations such as these, “incorrectly accepting ‘false’ data” is by far the more damaging, because if the computer itself does not catch the error, no one will.) Consider writing a defensive, suspicious process that vets the data first, line-by-line, resolving valid issues and throwing erroneous records out (into a separate “garbage bucket” file). If any records get thrown-out in this way, stop and scream. Otherwise, process the records ... using CPAN modules to your best advantage as you are able. Like all contributed code, CPAN modules are (to some degree) “designed for the general case,” and sometimes that is a good thing but sometimes it does not work out so well. As a last resort (so to speak), sometimes CPAN modules work well as “a source of inspiration.” And that is a legitimate purpose, too.
Re: problems parsing CSV by snape (Pilgrim) on Oct 10, 2010 at 00:30 UTC
Update: I apologize for wasting the space. Please refer to the write up below. Thanks.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks