helenwoodson has asked for the wisdom of the Perl Monks concerning the following question:
I'm rather new to Perl, having only had an introductory course - and now I'm doing some scripts for a friend. It's been, ah, educational.
The input was a couple of files with comma-delimited records, only some of the fields had embedded commas and were enclosed in double quotes. I'd never heard of CSV, so I reinvented the wheel and got it working. I later discovered Text::CSV, figured out how to install it - and that worked on my test data - but it choked on the diacritics when I ran it on the million-plus records for the real deal. I poked around and found Text::Undiacritic and attempted to install it, then poked around some more and figured out how to force the install. Whew.
That took care of the diacritics. Now it's choking on records that look something like this:
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""
I think the problem is in the third field - it's enclosed in double quotes and contains a parenthesized portion which in turn contains a double-quoted string.
Worst case, I could fall back on my reinvention of the wheel, but that would be cheating.
I'd appreciate very much any help.
Re: problems parsing CSV
by Corion (Patriarch) on Oct 09, 2010 at 14:10 UTC
|
Personally, I would try to clean up the problem at the source first, but I'm well aware that this is not always possible.
You can identify your stray double quotes, as they are (likely) not preceded or followed by a comma. Also, the number of elements per row should remain constant. This will make it fairly easy to write a cleanup program, or at least a program that flags the invalid lines for manual correction:
#!perl -w
my $columncount;
while (<>) {
print "Line $. has invalid quotes: $_"
if (/[^,]"[^,]/);
# Let's try to fix that line automatically:
s/([^,])"([^,])/$1""$2/g;
my @columns = /([^",]+)|"([^"]+|"")"/g;
$columncount ||= @columns;
if ($columncount <> 0+@columns) {
print "Line $. has weird commas and needs even more manual int
+ervention.";
};
};
| [reply] [d/l] |
Re: problems parsing CSV
by Khen1950fx (Canon) on Oct 09, 2010 at 21:18 UTC
|
#!perl
# This script can be used as a base to parse unreliable CSV streams
# Modify to your own needs
#
# (m)'08 [23 Apr 2008] Copyright H.M.Brand 2008-2010
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new ({ binary => 1,
blank_is_undef => 1,
eol => $/,
});
my $csa = Text::CSV_XS->new ({ binary => 1,
allow_loose_quotes => 1,
blank_is_undef => 1,
escape_char => undef,
});
my $file = @ARGV ? shift : "/root/Desktop/check_this";
open my $fh, "<", $file or die "$file: $!\n";
my %err_eol = map { $_ => 1 } 2010, 2027, 2031, 2032;
print STDERR "Reading $file with Text::CSV_XS $Text::CSV_XS::VERSION\n
+";
while (1) {
my $row = $csv->getline ($fh);
unless ($row) { # Parsing failed
# Could be end of file
$csv->eof and last;
# Diagnose and show what was wrong
my @diag = $csv->error_diag;
print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n";
my $ep = $diag[2] - 1; # diag[2] is 1-based
my $ein = $csv->error_input; # The line scanned so far
my $err = $ein . " ";
substr $err, $ep + 1, 0, "*"; # Bad character marked between **
substr $err, $ep, 0, "*";
($err = substr $err, $ep - 5, 12) =~ s/ +$//;
print STDERR " |$err|\n";
REPARSE: { # Now retry with allowed options
if ($csa->parse ($ein)) {
print STDERR "Accepted in allow mode ...\n";
$row = [ $csa->fields ];
}
else { # Still fails
my @diag = $csa->error_diag;
if (exists $err_eol{$diag[0]}) { # \r or \n inside field
print STDERR " Extending line with next chunk\n";
$ein .= scalar <$fh>;
goto REPARSE;
}
print STDERR " Also could not parse it in allow mode\n";
print STDERR " $./$diag[2] - $diag[0] - $diag[1]\n";
print STDERR " Line skipped\n";
next;
}
}
}
# Data was fine, print data properly quoted
$csv->print (*STDOUT, $row);
}
| [reply] [d/l] |
Re: problems parsing CSV
by Jenda (Abbot) on Oct 09, 2010 at 23:30 UTC
|
Others had already commented on the stray quotes problem, I'll get back to the diacritics ... you probably want to enable "binary" data in the fields. See the "binary" attribute in the Text::CSV docs.
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] |
Re: problems parsing CSV
by Anonymous Monk on Oct 09, 2010 at 13:56 UTC
|
Looks like the GIGO principle at work
#!/usr/bin/perl --
use strict;
use warnings;
use Text::CSV;
use Data::Dump::Streamer;
Main(@ARGV);
exit(0);
sub Main {
my $csv = Text::CSV->new;
warn my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3.
+0,""!;
$csv->parse($str) and Dump( [ $csv->fields ] );
$csv->combine(
qw' stuff more',
'foo (1994 "bar" only)',
1234, 1988, 3.0, ""
) and warn $csv->string;
}
__END__
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line
+14.
stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 16.
| [reply] [d/l] |
|
allow_loose_quotes / allow_loose_escapes seems to fix it
#!/usr/bin/perl --
use strict;
use warnings;
use Text::CSV;
use Data::Dump::Streamer;
Main(@ARGV);
exit(0);
sub Main {
#~ my $csv = Text::CSV->new ( { auto_diag => 1, } );
my $csv = Text::CSV->new ( { auto_diag => 1, allow_loose_quotes =
+> 1, allow_loose_escapes => 1,} );
warn "012345678901234567890123456789\n",
my $str = q!stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""!
+;
$csv->parse($str) and Dump([ $csv->fields ]);
$csv->combine(qw' stuff more', 'foo (1994 "bar" only)', 1234, 1988
+,3.0,"") and warn $csv->string;
}
__END__
012345678901234567890123456789
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" at temp.pl line
+15.
$ARRAY1 = [
'stuff',
'more',
'foo (1994 "bar" only)',
1234,
1988,
'3.0',
''
];
stuff,more,"foo (1994 ""bar"" only)",1234,1988,3, at temp.pl line 18.
| [reply] [d/l] |
Re: problems parsing CSV
by ww (Archbishop) on Oct 09, 2010 at 14:26 UTC
|
May we see your version of CSV?
Writing one is a fairly complex challenge; since you wrote one that apparently worked (within limits), comparing your work to Text::CSV might be enlightening. | [reply] |
|
Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it.
I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly.
The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow.
I've simplified the code to show the parsing attempt.
#!/usr/bin/perl -w
use strict;
use Text::Undiacritic qw(undiacritic);
use Text::CSV;
my ( $tri, $chem, $year, $lbs, $gms,
$rls, $csv, $err, @cols );
# open the release report file for input
#$rls = "rls.tst";
$rls = "../ecodata/releases.txt";
open( RLS, $rls ) || die "bad open $rls";
# RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams
while( <RLS> )
{
$_ = undiacritic($_);
s/\(\d{4} and after \"acid aerosols\" only\)//g;
$csv = Text::CSV->new();
next if ($. == 1);
if ($csv->parse($_)) {
@cols = $csv->fields();
$tri = $cols[0];
$chem = $cols[2];
$year = $cols[4];
$lbs = $cols[5];
$gms = $cols[6];
}
else {
$err = $csv->error_input;
print "Failed to parse line: $err";
}
}
close(RLS);
Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this:
Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid (
+1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,""
Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac
+id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""
I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested. | [reply] [d/l] [select] |
|
Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason.
What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know.
I suspect your few problems — malformed CSV records and text that isn't in the ASCII character encoding — can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended.
UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.
| [reply] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
use English qw( -no_match_vars );
use Text::CSV;
my $release_file = '../ecodata/releases.txt';
open my $release_fh, '<', $release_file
or die "Can't open release file $release_file: $OS_ERROR\n";
my $csv = Text::CSV->new({
allow_loose_quotes => 1,
binary => 1,
});
CSV_RECORD:
while (my $csv_record = <$release_fh>) {
# Skip header...
next CSV_RECORD if $INPUT_LINE_NUMBER == 1;
# SPECIAL CASE: Remove unwanted text...
$csv_record =~ s/ \(\d{4} and after "acid aerosols" only\)//g;
if (not $csv->parse($csv_record)) {
my $csv_error_input = $csv->error_input();
warn "Can't parse CSV record at line $INPUT_LINE_NUMBER"
. " of release file $release_file:\n$csv_error_input\n";
next CSV_RECORD;
}
# TRI, Release#, ChemName, RegNum, Year, Pounds, Grams
my ($tri, undef, $chem_name, undef, $year, $pounds, $grams)
= $csv->fields();
# [Do what you need to do with the data here]
}
close $release_fh;
exit 0;
I moved the Text::CSV->new() constructor outside the while loop. I moved the lexical variables inside the while loop. I added a space to the beginning of the regular expression pattern matching the string being globally removed from the CSV records.
| [reply] [d/l] |
|
while <RLS> )
{
# $_ = undiacritic($_);
# s/\(\d{4} and after \"acid aerosols\" only\)//g;
# $csv = Text::CSV->new();
$csv = Text::CSV->new (
{ allow_loose_quotes => 1 ,
escape_char => "\\",
binary => 1,
}
) or die "" . Text::CSV->error_diag ();
next if ($. == 1);
if ($csv->parse($_)) {
@cols = $csv->fields();
$tri = $cols[0];
$chem = $cols[2];
$year = $cols[4];
$lbs = $cols[5];
$gms = $cols[6];
}
else {
$err = $csv->error_input;
print "Failed to parse line: $err";
}
}
close(RLS);
allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().
Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.
Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:
if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els
+e { $lbs = $cols[5] }
if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else
+{ $gms = $cols[6] }
and then this test
if( !$lbs && !$gms )
gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.
Thanks so much for your help.
| [reply] [d/l] [select] |
|
|
|
|
|
|
#!/usr/bin/perl
use strict;
use warnings;
local $/;
my $data = <DATA>;
while( $data =~ m{ (?: \" ( .*? (?: \"\" .*? )*? ) \" | ( [^,]*? ) ) ( \, | \n ) }msxg ){
my $item = $1;
if( not defined $item ){
$item = $2;
}
my $nl = $3;
print "\t«$item»\n";
if( $nl eq "\n" ){
print '-' x 40, "\n";
}
}
__DATA__
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""
a,b,c,
x,y
"a","foo""bar",test
a,"foo"bar",test
</code> | [reply] [d/l] |
|
My ignorance here is exceeded only by my humility. I'm afraid I don't understand the question.
If you mean you want to see the code I wrote using Text::CSV and the attributes I set for it, that's in a code snippit I posted a little while ago.
If that's not what you mean and you will explain it to me, I'll post whatever it is if I can figure it out
| [reply] |
Re: problems parsing CSV
by dorward (Curate) on Oct 09, 2010 at 20:00 UTC
|
| [reply] [d/l] |
Re: problems parsing CSV
by snape (Pilgrim) on Oct 10, 2010 at 00:32 UTC
|
I think if you have the records in a constant format as you described earlier then you should use split command to remove the commas and double quotes and store it in an array. Please look the command functionalities at split perl
You then can use the array elements accordingly.
#!/usr/bin/perl
use strict;
use warnings;
open IN,"csv.txt" or die $!;
while(<IN>){
chomp($_);
my @arr = split(/,|"|\s/,$_);
print $arr[0],"\t",$arr[1],"\n",$arr[2],"\n",$arr[3],"\n",$arr[4],
+"\n",$arr[5],"\n",$arr[6],"\n",$arr[7],"\n",$arr[8],"\n",$arr[9],"\n"
+,$arr[10],"\n",$arr[11],"\n",$arr[12],"\n";
print @arr;
}
Run the program and you will understand what happens with the use of the split command and how it splits the entire statement | [reply] [d/l] |
Re: problems parsing CSV
by Marshall (Canon) on Oct 10, 2010 at 09:55 UTC
|
This CSV (Comma Separated Value) format is not as easy as it seems. I did some testing with Excel and also found this article on wikipedia: CSV format.
To "escape" a "(double quote) in the CSV format you put another " (double quote) in front of it (or behind it) - who is to say which one is which - but two means one.
I am not a regex guru, but there are folks here who are. The below shows what needs to be done in this particular case. More general solutions are possible.
#!/usr/bin/perl -w
use strict;
=doc
stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""
should be:
stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,""
or could also be:
stuff,more,foo (1994 "bar" only),1234,1988,3.0,""
=cut
my $x = 'stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""';
$x =~ s/ "/ ""/g;
$x =~ s/" /"" /g;
print $x;
# prints:
# stuff,"more","foo (1994 ""bar"" only)",1234,1988,3.0,""
| [reply] [d/l] |
Re: problems parsing CSV
by sundialsvc4 (Abbot) on Oct 10, 2010 at 14:18 UTC
|
As you well know, there’s a “Catch-22” here. With over a million records to process, there is no practical way to hand-inspect every one. Your efforts to make the data “able to be processed” can also, if you are not cautious, enable incorrect data to be processed ... or for data to be processed incorrectly such that “you would never know.”
(It’s the quintessential “Type-1 / Type-2 Errors” principle from Stats class. In situations such as these, “incorrectly accepting ‘false’ data” is by far the more damaging, because if the computer itself does not catch the error, no one will.)
Consider writing a defensive, suspicious process that vets the data first, line-by-line, resolving valid issues and throwing erroneous records out (into a separate “garbage bucket” file). If any records get thrown-out in this way, stop and scream. Otherwise, process the records ... using CPAN modules to your best advantage as you are able.
Like all contributed code, CPAN modules are (to some degree) “designed for the general case,” and sometimes that is a good thing but sometimes it does not work out so well. As a last resort (so to speak), sometimes CPAN modules work well as “a source of inspiration.” And that is a legitimate purpose, too.
| |
Re: problems parsing CSV
by snape (Pilgrim) on Oct 10, 2010 at 00:30 UTC
|
| [reply] |
|
|