Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^2: split string by comma

by Tux (Abbot)
on Jan 11, 2012 at 07:13 UTC ( #947304=note: print w/replies, xml ) Need Help??

in reply to Re: split string by comma
in thread split string by comma

That regular expression is way too posessive. Think about how that would parse

1,"foo",2,"bar, joy",3,3.14,pi,π

Correct regular expressions have been posted in this thead, but when dealing with real CSV data (what about embedded newlines?), you will most likely end up with failure eventually when sticking to split or regular expressions. Please seriously consider using Text::CSV_XS or Text::CSV (which will use Text::CSV_XS when installed) and be done with it.

Another thing seldom considered by US users is that the "." in those "values" is locale dependent. Consider what will happen if 3623494.92 is printed as 3,623,494.92 or printed/exported in Dutch local using both radix sep and triad sep from the locale. It will export as "3.623.494,92". Oh, the horror in "fixing" all those regular expressions :)

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^3: split string by comma
by Neighbour (Friar) on Jan 11, 2012 at 08:36 UTC
    In order to avoid failure with embedded newlines (or your other record-separator of choice), I use this:
    my $old_INPUT_RECORD_SEPARATOR = $/; $/ = $self->record_delimiter; open (DELIMFILE, '<', $filename) or (Carp::confess("Cannot open fi +le [$filename]: $!")); my $record; while (<DELIMFILE>) { chomp; $record = $_; # If a line contains an odd amount of doublequotes ("), then w +e'll need to continue reading until we find another line that contain +s an odd amount of doublequotes. # This is in order to catch fields that contain recordseparato +rs (but are encased in ""'s). if (grep ($_ eq '"', split ('', $_)) % 2 == 1) { # Keep reading data and appending to $record until we find + another line with an odd number of doublequotes. while (<DELIMFILE>) { $record .= $_; if (grep ($_ eq '"', split ('', $_)) % 2 == 1) { last; + } } } ## end if (grep ($_ eq '"', split...)) push (@{$ar_returnvalue}, ReadRecord($self, $record)); } ## end while (<DELIMFILE>) close (DELIMFILE); $/ = $old_INPUT_RECORD_SEPARATOR;
    And ReadRecord uses a regex to consume the string field by field:
    my $field_value; my $delimiter = $self->field_delimiter; while ($inputstring) { undef $field_value; if ($inputstring =~ /^"/) { $field_value = $inputstring; if ($inputstring =~ /^"(([^"]|"")+)"(?:[$delimiter]|$)/p) { ($field_value, $inputstring) = ($1, ${^POSTMATCH}); # Unescape escaped quotes $field_value =~ s/""/"/g; } else { Carp::confess("Parsing error with remaining data [$inputst +ring]"); } } else { $field_value = $inputstring; if ($inputstring =~ /^([^$delimiter"]*)(?:[$delimiter]|$)/p) { ($field_value, $inputstring) = ($1, ${^POSTMATCH}); } } ## end else [ if ($inputstring =~ /^"/)] }
    This conforms to RFC 4180 :)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://947304]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2020-10-25 06:51 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (249 votes). Check out past polls.