comment on

I am trying to use a plain perl regex s/// to fix up the formatting of fields in a CSV file, so that the real parser will no longer choke on it. The fields, separated by semicolons, are formatted like this:

Text fields are between double quotes. An internal double quote is doubled.
Numeric fields are unquoted and use a comma as a decimal separator.
Empty fields contain only a question mark, unquoted.

What I'm trying to do is to leave the quoted fields alone, replace the comma in numeric fields with ".", and drop the unquoted question mark.

The basis of what I've been using looks like this — I've added extensive regex comment, describing what it does:

    s( ("[^"]*")     # a quoted field, or standalone part of a field
      | (?<![^;])    # start of line or preceded by semicolon = start 
+of field
        ( [\-\d,]+   # characters most likely forming a number
          | ([?]) )  # or a "?" 
        (?![^;])     # end of line or followed by semicolon  = end of 
+field
    )                # end of regex, start of substitution
    {
        $1 or        # replace quoted string by itself = skip
        $3 ? ''      # a bare unquoted '?', delete
        : do { (my $number = $2)   # must be a number
              =~ tr/,/./;       # replace ',' with '.'
            $number }           # return value
    }xge;
[download]

Now the part that I'm having some trouble with: I'm trying to add support for multiline records, thus containing newlines within quoted strings, but without reading in the whole data file at once. Now I can detect if a quoted string is still open by making the closing quote optional, and checking for its presence. The problem is: how do you continue parsing the same open string, until you find the first semicolon, on the next line?

My idea was that, if the previous line was closed, the pattern should work as above, but if we were in a quoted field at the end, it should behave like:

m( ( (?:^|") [^"]* ("?) ) | (?<![^;]) ( [\-\d,]+ | ([?]) ) (?![^;]) )x
[download]

instead. Now how do you do that? I've tried experimenting with the, still marked as "highly experimental" after over 5 years, features of (?{CODE})but I don't quite get it, and I couldn't get it to work properly. Because of its "experimental nature" (it may be here to stay, but that doesn't mean it has been properly debugged), I'd like to avoid it, anyway.

I've also though about using /"/g to skip any leading remainders of a quoted string, but s///g simply ignores \G.

So... What would you do?

In reply to Conditional continued matching with regexes by bart

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks