Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Multiple double quotes within csv

by mhooper (Novice)
on May 12, 2017 at 21:33 UTC ( [id://1190163]=perlquestion: print w/replies, xml ) Need Help??

mhooper has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use Text::CSV to parse a csv file and one of the fields occasionally contains strings within the field that are quoted, when encountered the parser errors out. What I'm trying to parse that errors looks like the following Field1,""Quoted String" rest of the data, I'm using sep_char of , but nothing else. Is there a way to get this module to parse this line correctly? thanks

Replies are listed 'Best First'.
Re: Multiple double quotes within csv
by Marshall (Canon) on May 12, 2017 at 23:26 UTC
    I agree with Laurent_R , expanding the thought more...

    This: Field1, ""Quoted String" rest of the data, presumably Field3 is here, Field4 looks odd to me. If there are unbalanced quotes in Field2 like you are showing, that could definitely cause problems. What do you think that the two sequential " characters at the beginning of Field2 mean?

    The best would be if you showed a short piece of code that demo'es the exactly your problem and in this case, please specify exactly what the expected output should be. As a suggestion, I would also recommend that you import your CSV into Excel and see what it does with it. You can also use Excel as a "CSV Reference Implementation", enter in a row of data and then see what Excel generates. I have never worked for MS and I am not an Excel "fan". But I have never seen CSV-Text fail to parse something that Excel generated.

    At this point, I am not sure whether we are dealing with an improper CSV format or a CSV-Text error or whatever. I have used this CSV module and gotten good results with it. The CSV format is devilishly complicated when weirdo cases are considered. It is best if we can work with a verbatim example, I would put it within <code>..</code> blocks to be sure that everybody is talking about exactly the same thing.

    Update: See below post with test case from the OP and others from me. This looks like invalid CSV. With more examples of these invalid lines, I suppose an ad-hoc algorithm can be designed to "fix" the CSV before feeding it into Text-CSV.

      #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new({ sep_char => ',' }); while (my $line = <DATA>) if ($csv->parse($line)) { my @fields = $csv->fields(); print print "$fields[0],"; print "$fields[1],"; print "$fields[2]\n"; } else { warn "Line could not be parsed: $line\n"; } } __DATA__ 0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
      I was hoping to get the output 0,"Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9 but also acceptable as 0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        use Text::CSV; my $csv = Text::CSV->new ({ auto_diag => 1, allow_loose_quotes => 1, # optional, also works without this attr +ibute allow_loose_escapes => 1, }); while (my $row = $csv->getline (*DATA)) { say for @$row; } __END__ 0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9

        will produce

        0 "Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052 9
        Your code does not compile.

        Here is some Working code, and the output it produces:

        #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new({ sep_char => ',', quote_char => undef , esc +ape_char=>undef }); while (my $line = <DATA>){ if ($csv->parse($line)) { my @fields = $csv->fields(); print "$fields[0],"; print "$fields[1],"; print "$fields[2]\n"; } else { warn "Line could not be parsed: '$line'\n"; my ($cde, $str, $pos, $rec, $fld) = $csv->error_diag (); print "DIAG:(CDE=$cde, STR=$str, POS=$pos, REC=$rec, FLD=$fld)\n +" } } __DATA__ 0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        >perl test2.pl 0,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        The quotes at the start of Rat Control are problematic, and produce this error on default settings:
        DIAG:(CDE=2023, STR=EIQ - QUO character not allowed, POS=4, REC=1, FLD=2)

                ...Disinformation is not as good as datinformation.               Don't document the program; program the document.

        Update: I think I'm closer:
        RFC-4180, paragraph "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."

        So:

        2,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        is malformed, incorrect CSV, this should be:
        2,"""Rat Control"" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        I made some experiments. Here are my (updated) results:
        #!/usr/bin/perl use strict; use warnings; $|=1; ## turn off buffering for STDOUT use Text::CSV_XS qw( csv ); my $csv = Text::CSV_XS->new(); #using the defaults while (my $line = <DATA>) { if ($csv->parse($line)) { my @fields = $csv->fields(); print join ("|",@fields),"\n"; } else { warn "Line could not be parsed: $line\n"; } } =Prints: 1|Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9 2|"Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9 3|Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052|9 Line could not be parsed: 4,"Rat Control" <sip:+15559999999@192.168 .5 +.233>;tag=gK004bb052,9 5|123,456|abc 6|Rat|xyz 7|Rat Control|xyz Line could not be parsed: 8,""Rat Control" <sip:+15559999999@192.168 . +5.233>;tag=gK004bb052",9 =cut __DATA__ 1,Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9 2,"""Rat Control"" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9 3,Rat Control <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9 4,"Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052,9 5,"123,456",abc 6,"Rat",xyz 7,"Rat Control",xyz 8,""Rat Control" <sip:+15559999999@192.168 .5.233>;tag=gK004bb052",9
        I do not understand why Line 4 which starts with unnecessary quotes is not parsed? update: But could be that the double quotes must apply to the whole field and therefore the syntax in line 2 must be used.See Line 5 which has an embedded comma and requires the quotes and is parsed correctly. See Lines 6, 7. I don't think the starting quotes are the issue, it appears that other "special" characters in Field2 are causing the problem.
        Instead of sep_char try
        binary => 1, allow_loose_quotes => 1, blank_is_undef => 1, escape_char => undef,
Re: Multiple double quotes within csv
by Laurent_R (Canon) on May 12, 2017 at 21:55 UTC
    This is not very clear to me. Can you please show an example (or several) of the string(s) that fails to be recognized?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1190163]
Approved by marto
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-26 07:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found