Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Regex help for CSV Multiline handling

by emgrasso (Novice)
on Feb 24, 2005 at 21:19 UTC ( [id://434263] : perlquestion . print w/replies, xml ) Need Help??

emgrasso has asked for the wisdom of the Perl Monks concerning the following question:

I have some CSV data to process that includes multiline fields, some of which begin with CRs. In general, I need to be able to identify two or three kinds of lines in the data from my input file:
lines ending in " without a preceding ,
(possibly) lines ending in ","
lines ending in anything else.
My regex skills for dealing with punctuation at ends of lines seem to be a bit rusty. I'd appreciate any suggestions.

Replies are listed 'Best First'.
Re: Regex help for CSV Multiline handling
by dragonchild (Archbishop) on Feb 25, 2005 at 00:20 UTC
    Use Text::xSV - it was designed for dealing with this kind of situation.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Regex help for CSV Multiline handling
by jZed (Prior) on Feb 25, 2005 at 00:46 UTC
    Use Text::CSV_XS, with binary=>1 and it will handle your CSV much faster and cleaner than any regex you can come up with.
Re: Regex help for CSV Multiline handling
by perlfan (Vicar) on Feb 24, 2005 at 23:00 UTC
    My regex skills are not so polished either, but let me know how I do:
    1: m/^([.]*)[^,]"\s*$/ 2: m/^([.]*),\s$/ 3: m/^(.*)$/
    Note: for #1, I am not sure if by "without a preceding" you be no commas or just not one before "

    I suggest not creating one regex to "rule them all"; instead check for each line in the order of precedence that you want; for example, #1 is probably your 'catch-all'.

    Your question is actually very general, so if you are looking for more specific help in doing something, you need to get more detailed.