mhooper has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to use Text::CSV to parse a csv file and one of the fields occasionally contains strings within the field that are quoted, when encountered the parser errors out.
What I'm trying to parse that errors looks like the following
Field1,""Quoted String" rest of the data,
I'm using sep_char of , but nothing else. Is there a way to get this module to parse this line correctly?
thanks
Re: Multiple double quotes within csv
by Marshall (Canon) on May 12, 2017 at 23:26 UTC
|
I agree with Laurent_R , expanding the thought more...
This: Field1, ""Quoted String" rest of the data, presumably Field3 is here, Field4 looks odd to me. If there are unbalanced quotes in Field2 like you are showing, that could definitely cause problems. What do you think that the two sequential " characters at the beginning of Field2 mean?
The best would be if you showed a short piece of code that demo'es the exactly your problem and in this case, please specify exactly what the expected output should be. As a suggestion, I would also recommend that you import your CSV into Excel and see what it does with it. You can also use Excel as a "CSV Reference Implementation", enter in a row of data and then see what Excel generates. I have never worked for MS and I am not an Excel "fan". But I have never seen CSV-Text fail to parse something that Excel generated.
At this point, I am not sure whether we are dealing with an improper CSV format or a CSV-Text error or whatever. I have used this CSV module and gotten good results with it. The CSV format is devilishly complicated when weirdo cases are considered. It is best if we can work with a verbatim example, I would put it within <code>..</code> blocks to be sure that everybody is talking about exactly the same thing.
Update: See below post with test case from the OP and others from me. This looks like invalid CSV. With more examples of these invalid lines, I suppose an ad-hoc algorithm can be designed to "fix" the CSV before feeding it into Text-CSV. | [reply] [d/l] |
|
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ',' });
while (my $line = <DATA>)
if ($csv->parse($line)) {
my @fields = $csv->fields();
print
print "$fields[0],";
print "$fields[1],";
print "$fields[2]\n";
} else {
warn "Line could not be parsed: $line\n";
}
}
__DATA__
0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
I was hoping to get the output
0,"Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
but also acceptable as
0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
| [reply] [d/l] |
|
use Text::CSV;
my $csv = Text::CSV->new ({
auto_diag => 1,
allow_loose_quotes => 1, # optional, also works without this attr
+ibute
allow_loose_escapes => 1,
});
while (my $row = $csv->getline (*DATA)) {
say for @$row;
}
__END__
0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
will produce
0
"Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052
9
| [reply] [d/l] [select] |
|
|
|
|
|
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ',', quote_char => undef , esc
+ape_char=>undef });
while (my $line = <DATA>){
if ($csv->parse($line)) {
my @fields = $csv->fields();
print "$fields[0],";
print "$fields[1],";
print "$fields[2]\n";
} else {
warn "Line could not be parsed: '$line'\n";
my ($cde, $str, $pos, $rec, $fld) = $csv->error_diag ();
print "DIAG:(CDE=$cde, STR=$str, POS=$pos, REC=$rec, FLD=$fld)\n
+"
}
}
__DATA__
0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
>perl test2.pl
0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
The quotes at the start of Rat Control are problematic, and produce this error on default settings:
DIAG:(CDE=2023, STR=EIQ - QUO character not allowed, POS=4, REC=1, FLD=2)
...Disinformation is not as good as datinformation.
Don't document the program; program the document.
| [reply] [d/l] [select] |
|
|
|
|
Update: I think I'm closer:
RFC-4180, paragraph "If double-quotes are used to enclose fields,
then a double-quote appearing inside a field must be escaped by
preceding it with another double quote."
So:
2,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
is malformed, incorrect CSV, this should be:
2,"""Rat Control"" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
I made some experiments. Here are my (updated) results:
#!/usr/bin/perl
use strict;
use warnings;
$|=1; ## turn off buffering for STDOUT
use Text::CSV_XS qw( csv );
my $csv = Text::CSV_XS->new(); #using the defaults
while (my $line = <DATA>)
{
if ($csv->parse($line)) {
my @fields = $csv->fields();
print join ("|",@fields),"\n";
} else {
warn "Line could not be parsed: $line\n";
}
}
=Prints:
1|Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9
2|"Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9
3|Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9
Line could not be parsed: 4,"Rat Control" <sip:+15559999999@192.168 .5
+.233>;tag=gK004bb052,9
5|123,456|abc
6|Rat|xyz
7|Rat Control|xyz
Line could not be parsed: 8,""Rat Control" <sip:+15559999999@192.168 .
+5.233>;tag=gK004bb052",9
=cut
__DATA__
1,Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9
2,"""Rat Control"" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
3,Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9
4,"Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9
5,"123,456",abc
6,"Rat",xyz
7,"Rat Control",xyz
8,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
I do not understand why Line 4 which starts with unnecessary quotes is not parsed? update: But could be that the double quotes must apply to the whole field and therefore the syntax in line 2 must be used.See Line 5 which has an embedded comma and requires the quotes and is parsed correctly. See Lines 6, 7. I don't think the starting quotes are the issue, it appears that other "special" characters in Field2 are causing the problem. | [reply] [d/l] [select] |
|
binary => 1,
allow_loose_quotes => 1,
blank_is_undef => 1,
escape_char => undef,
| [reply] [d/l] |
Re: Multiple double quotes within csv
by Laurent_R (Canon) on May 12, 2017 at 21:55 UTC
|
This is not very clear to me. Can you please show an example (or several) of the string(s) that fails to be recognized?
| [reply] |
|
|