http://qs321.pair.com?node_id=270972

cidaris has asked for the wisdom of the Perl Monks concerning the following question:

I hate to post regex questions, but I'm second-guessing myself every time I think I make progress.

The problem: I have a CSV file that Text::CSV is handling very nicely.

Unfortunately, it will balk on data like this:
"crosby","stills","nash","and sometimes "young""
The input can include any number of characters, and unfortunately, I don't have control over the input to tell people "hey, don't use anything but letters or numbers!" Believe you me, I'd love to put some constraints on their input, but it's a proprietary tool, and, well, I could lecture until I was blue in the face and some snot-nosed kid would immediately enter every non-alphanumeric character he could find.
I had tried this, but it's not right:
if ($line =~ m/".?".?".?"/g)

because it will match the "," that I'm trying to delimit with. My next thought was something like this:
if ($line =~ m/"[^\,]?"[^\,]?"[^\,]?"/g)

but that would only work if a comma were a class of character... And probably not even then ;) I guess in pseudo-code, I'm after something like this:
if ($line =~ m/"(ANY # OF NON-COMMAS)"(ANY # OF NON-COMMAS)"(ANY # OF +NON-COMMAS)"/g)

Can anyone who has been through this kind of nightmare help me?

Thanks,
cidaris

Replies are listed 'Best First'.
Re: CSV and regex mixups
by Zaxo (Archbishop) on Jul 02, 2003 at 22:26 UTC

    CSV says that embedded quotes should be doubled in the CSV field. To see Text::CSV's notion of that,

    $ perl -MText::CSV -e'$c = Text::CSV->new(); $c->combine qw/Crosby Stills Nash/, q/and sometimes "Young"/; print $c->string, $/' "Crosby","Stills","Nash","and sometimes ""Young""" $
    It sounds as if your application is not producing valid CSV to that standard. See the CAVEATS section of the Text::CSV perldoc for the CSV convention the module is written to.

    Check Anydata::Format::CSV if you cannot get your data in Text::CSV's preferred format. It allows you to construct a parser with your choices for 'field_sep', 'quote', 'escape', and 'record_sep'. That may not fix all your problems if the app has a plain inadequate notion of CSV, but it might work.

    After Compline,
    Zaxo

      Building on what Zaxo said about correct CSV quoting of double-quotes, you might as well filter the CSV-file before you process it with Text::CSV.
      the following regex, using zero-width look-ahead/-behind assertions, works quite well (with one flaw):

      (?<![,])"(?![,])
      A little test:
      #!/usr/bin/env perl my $test = '""crosby"","stills","nash","and sometimes "young""'; $test =~ s/(?<![,])"(?![,])/""/g; print $test, "\n"; $test = '"som"ething","sil"ly","quo"ted"'; $test =~ s/(?<![,])"(?![,])/""/g; print $test, "\n"; __END__ """"crosby""","stills","nash","and sometimes ""young"""" ""som""ething","sil""ly","quo""ted""
      The flaw, as you might've already seen, is that the look-ahead/behind assertions recognize the start and end of the line as 'not comma', and therefore substitute the leading and the tailing double-quote too. If your data is otherwise well-formed, all you have to do, is add a second filter:
      $test =~ s/^"(.*)"$/$1/g;

      regards,
      tomte


      Hlade's Law:

      If you have a difficult task, give it to a lazy person --
      they will find an easier way to do it.

Re: CSV and regex mixups
by BrowserUk (Patriarch) on Jul 02, 2003 at 23:35 UTC

    If your trying to escape the embedded quotes before passing it to Text::CSV, then this might do the trick.

    Note: I've added some extra embedded quotes to check that it doesn't escaped already escaped quotes and handles the edge cases at either end of the string. You'll need to throw a few more tests at this before using it in anger.

    $s='"crosby","stills","nash",""and" ""sometimes"" "young""' # capture everything between a quote and a quote # follow by a comma or the end of string $s =~ s[ " (.*?) " ( , | $ ) ] { # Look for unescaped embedded quotes and escape them (my $t = $1) =~ s[ (?<!") " (?!") ][""]gx; # put back the outer quotes and the comma if there was one '"' . $t . '"' . ($2||''); }gex; print $s; "crosby","stills","nash","""and"" ""sometimes"" ""young"""

    This assumes that $s would otherwise be parsed correctly by Text::CSV.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: CSV and regex mixups
by flounder99 (Friar) on Jul 03, 2003 at 11:57 UTC
    It sounds to me like using Text::CSV might be a waste of time. If you get a working regex why not just use it for everything? Modules are for convenience, if you have to parse a line before handing it to a parser why bother with the parser?
    Just a thought.

    --

    flounder

      I felt kind of bad just giving an opinion and not a useful suggestion so here is my suggestion:
      use strict; my $val = '"crosby","stills","nash","and sometimes "young""'; print join "\n", split /(?<="),(?=")/, $val; __OUTPUT__ "crosby" "stills" "nash" "and sometimes "young""
      This works if you know you know your fields will always be surrounded by double quotes and there will be no spaces arround the comma. The spaces could be taken care of by using the regex /(?<=")\s*,\s*(?=")/.

      --

      flounder

Re: CSV and regex mixups
by aquarium (Curate) on Jul 02, 2003 at 22:10 UTC
    you do need to strip/convert any commas and quotes from your input or csv::text will get it wrong. you should also strip any non-printing chars and semicolons etc., ie define a strict set of characters to allow rather than dis-allowing some. this should be a no brainer for you if you just consider the input one at a time; not the entire assembled string.
    btw - in your regex with .? that stands for zero or one occurences of any character...not any number of characters, as i believe you intended. i think you were after .+ match one or more times.
      Nope, by any number, I meant 0 or more, as several of the fields come back like "bono","the edge","","larry mullen, jr."

      No, I'm not making a music database, just spitting out example values as they occurr to me.
Re: CSV and regex mixups
by clscott (Friar) on Jul 03, 2003 at 17:29 UTC