http://qs321.pair.com?node_id=598431

bart has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use a plain perl regex s/// to fix up the formatting of fields in a CSV file, so that the real parser will no longer choke on it. The fields, separated by semicolons, are formatted like this:

What I'm trying to do is to leave the quoted fields alone, replace the comma in numeric fields with ".", and drop the unquoted question mark.

The basis of what I've been using looks like this — I've added extensive regex comment, describing what it does:

s( ("[^"]*") # a quoted field, or standalone part of a field | (?<![^;]) # start of line or preceded by semicolon = start +of field ( [\-\d,]+ # characters most likely forming a number | ([?]) ) # or a "?" (?![^;]) # end of line or followed by semicolon = end of +field ) # end of regex, start of substitution { $1 or # replace quoted string by itself = skip $3 ? '' # a bare unquoted '?', delete : do { (my $number = $2) # must be a number =~ tr/,/./; # replace ',' with '.' $number } # return value }xge;

Now the part that I'm having some trouble with: I'm trying to add support for multiline records, thus containing newlines within quoted strings, but without reading in the whole data file at once. Now I can detect if a quoted string is still open by making the closing quote optional, and checking for its presence. The problem is: how do you continue parsing the same open string, until you find the first semicolon, on the next line?

My idea was that, if the previous line was closed, the pattern should work as above, but if we were in a quoted field at the end, it should behave like:

m( ( (?:^|") [^"]* ("?) ) | (?<![^;]) ( [\-\d,]+ | ([?]) ) (?![^;]) )x
instead. Now how do you do that? I've tried experimenting with the, still marked as "highly experimental" after over 5 years, features of (?{CODE})but I don't quite get it, and I couldn't get it to work properly. Because of its "experimental nature" (it may be here to stay, but that doesn't mean it has been properly debugged), I'd like to avoid it, anyway.

I've also though about using /"/g to skip any leading remainders of a quoted string, but s///g simply ignores \G.

So... What would you do?

Replies are listed 'Best First'.
Re: Conditional continued matching with regexes
by dah (Sexton) on Feb 06, 2007 at 01:26 UTC
    I'd keep adding lines until I had an even number of double quotes. Only when I had an entire record would I try to munge it with your regex.

    Something like

    $_ .= <FILE> while s/"/"/ % 2;
    (untested), probably checking for errors too.
      Wah, this is clever... I'd never have thought of that! Well, I will, in the future! ;-) (n.b. you need to add the /g modifier)

      BTW, there's a syntax for tr that serves to just count occurrences, without changing anything. Just leave the RHS empty, which is like replacing characters by themselves while counting, but which is optimizined to skip the replacing.

      tr/"//

      You still need to cater for the abnormal case where the file contains an odd number of quotes, thus, check for eof.

      Update Judging by the tests I've done, the next snippet will work excellently for reading and processing data from <>, and includes processing for the final record for each file, even if incomplete:

      while(<>) { while(!eof and tr/"// % 2) { $_ .= <>; } # ... do stuff with each record in $_ }
        Hurrah!
Re: Conditional continued matching with regexes
by diotalevi (Canon) on Feb 05, 2007 at 22:43 UTC

    Perhaps you meant to look at (??{...}). That takes a bit of perl code, returns a string or qr// object and uses that as the next regexp fragment. (?{...}) just runs some perl code and only (?(...)...|...) uses its return value for anything.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Conditional continued matching with regexes
by almut (Canon) on Feb 05, 2007 at 23:56 UTC

    Just another idea, which might or might not be applicable in your case.

    I too once had the requirement to allow multiline entries (within one field). The CSV tables were written by Excel, so it was possible to set things up such that newlines within-field were \n Unix-style, while the row records were being separated by \r\n Windows-style newlines. That way, I could easily distinguish them, in order to read in one row at a time from the CSV file.

    Not sure if you have control over how the CSV files are being generated, but if so, maybe you can arrange for something similar...

    Also, personally, I'd use Text::CSV_XS (or some such) to properly split up fields, and then do my manipulations on them...   OTOH, I don't want to spoil your fun finding a nice regex solution :)

      Not sure if you have control over how the CSV files are being generated, ...
      Just think: if I did have that kind of control, then wouldn't you expeect that I could take care of this kind of tweaks over there, too, so there wouldn't be any need to postprocess the CSV files, at all?
Re: Conditional continued matching with regexes
by exussum0 (Vicar) on Feb 05, 2007 at 22:43 UTC
    What would *I* do? Write a parser. ^^

    No... really.

    Edit: Technically - a regexp is a parser where the language is the regexp, but I'd use something ala Rec::Descent. :)

      Wouldn't a Parser::RecDescent solution suffer exactly the same problems?

      Wouldn't you either have to load the entire file and parse it as a single lump; or parse it line by line until the parser failed, append the next line to the failed line and then re-attempt from the beginning?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yup. But doing all the logic in a regexp feels less comfortable to me. I'm just more versed in parser/lexer's for logic that is more complex. The entire putting perl code within regexp's is fine, just not my way.

        After all, bart asked me what would *I* do. :)

      What use is a parser, when a lexer will do?

      "When all you have is a hammer, every problem looks like a nail." And a parser is an excellent hammer.

      What I mean is: there's nothing remotely recursive in the definition of the syntax of CSV. Which is when you need to use a parser.

        Hey, you asked what I'd use. I wouldn't use a regexp. Not because it's wrong, but what you posted, while may work, is not the easiest thing for me to just mind dump into code.

        Was it your intention to ask what *I* would do and then critique me on it? If that were to happen, I never would have answered at all.