Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

split string by comma

by Anonymous Monk
on Jan 10, 2012 at 23:05 UTC ( #947251=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

If a file is being split and the line it is on has this:

print $_fileLine;

and it contained this:

1945,"4,399.00",938,1/10/2012

and if I split it like this:

my @_fileparts = split(/\,/, $_fileLine);
Then it would split the dollar amount too...

So how do I make it not split values that are between the "" lines?

Thanks,
Richard

Replies are listed 'Best First'.
Re: split string by comma
by GrandFather (Sage) on Jan 10, 2012 at 23:16 UTC

    Looks like a CSV file so use an appropriate module for the task: Text::CSV.

    True laziness is hard work
Re: split string by comma
by davido (Cardinal) on Jan 11, 2012 at 00:04 UTC

    GrandFather is probably correct in assuming that you're dealing with run-of-the-mill CSV, and his recommendation for Text::CSV is canonical in such cases. Text::CSV_XS is another alternative if throughput is an issue.

    Sometimes a picture is worth a thousand words, so I wanted to provide an example of how easy Text::CSV makes it to achieve a robust solution.

    use strict; use warnings; use Text::CSV; use Data::Dumper; my @rows = ( q{1945,"4,399.00",938,1/10/2012},# Original test case. q{1945,4,399.00",938,1/10/2012}, # Missing quote intentional to te +st # behavior with malformed CSV. # Warning expected. q{1945,4,399.00,938,1/10/2012}, # A simple case (nothing quoted). q{"abc","de,f","ghi",jkl}, # Alpha with mixed quoting/commas +. ); my $csv = Text::CSV->new ( { binary => 1 } ) or die "Cannot use CSV: " . Text::CSV->error_diag; my @parsed; foreach my $row ( @rows ) { $csv->parse( $row ) or do{ # ^---- Warning results from line above when parsing bad CSV. warn "Couldn't parse [$row]: Possibly malformed CSV"; next; }; push @parsed, [ $csv->fields ]; } print Dumper \@parsed if @parsed;

    Be sure to read the docs for Text::CSV prior to just dropping code from my example into place in your script. It's possible that your specific data set may require additional work such as Text::CSV configuration, data pre-processing, or result restructuring.

    Update: Of course your first step is probably going to be to execute the shell command: "cpan -i Text::CSV". This will pull the module in from CPAN and install it so that it's available for use. This approach works for most Perl installations on Unix/Linux as well as Strawberry Perl on Windows. For ActivePerl you could use the ppm tool to manage your module installation.


    Dave

Re: split string by comma
by BrowserUk (Pope) on Jan 10, 2012 at 23:35 UTC

    If your sample text reflects a true indication of your needs -- it often doesn't -- then you could use:

    $s = '1945,"4,399.00",938,1/10/2012';; print for $s =~ m[("[^"]+"|[^,]+),]g;; 1945 "4,399.00" 938

    For data that conforms to the original formulation of 'csv' data, rather than the bastardized corruption of that once de-facto standard that is now foisted upon us, this is all you need, and it usually runs several times faster than Text::CSV* modules.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: split string by comma
by johngg (Canon) on Jan 11, 2012 at 00:16 UTC

    Using a module as GrandFather suggests is probably the safest option. However, just to show another way with this particular data, you could use the third argument to split in order to work in from either end.

    knoppix@Microknoppix:~$ perl -E ' > $_fileline = q{1945,"4,399.00",938,1/10/2012}; > ( $_fileparts[ 0 ], $remainder ) = split m{,}, $_fileline, 2; > push @_fileparts, > reverse > map scalar reverse, > split m{,}, reverse( $remainder ), 3; > say for @_fileparts;' 1945 "4,399.00" 938 1/10/2012 knoppix@Microknoppix:~$

    I hope this is of interest.

    Update: Assigning to an array slice avoids the final reverse.

    knoppix@Microknoppix:~$ perl -E ' > $_fileline = q{1945,"4,399.00",938,1/10/2012}; > ( $_fileparts[ 0 ], $remainder ) = split m{,}, $_fileline, 2; > @_fileparts[ 3, 2, 1 ] = > map scalar reverse, > split m{,}, reverse( $remainder ), 3; > say for @_fileparts;' 1945 "4,399.00" 938 1/10/2012 knoppix@Microknoppix:~$

    Cheers,

    JohnGG

Re: split string by comma
by dd-b (Monk) on Jan 10, 2012 at 23:40 UTC

    The problem is, it's not a string. That is, there's additional structure. The quotes apparently mean something special, including that you shouldn't split on commas within them. (This quickly raises the question of what it means if there are quotes within quotes, and how you escape quoting, and...soon you're having to work far too hard.) (This is a common situation of course; I sort of wrote about it as if I hadn't seen it before, but it's actually a staple of programming.)

    You don't tell us what the actual format is; GrandFather's guess that it's a CSV seems reasonable, but it's just a guess. So, if you're lucky, there's a clear description of what the rules for this CSV format are, and the input you get will reliably follow them, and you can write something to parse that. But the rule here is that parsing is hard -- if you do anything more than the most brute-force simple kind.

    If you're really lucky, your input conforms to the standard that the CSV module that was suggested supports, and you don't have to write the parsing yourself.

Re: split string by comma
by ww (Archbishop) on Jan 11, 2012 at 01:16 UTC

    TIMTOWTDI, albeit, a less than entirely satisfactory way, requiring another step beyond what's here:

    #!/usr/bin/perl use Modern::Perl; use Data::Dumper; # 947251 my $string = '1945,"4,399.00",938,1/10/2012'; my @string = split /(,".*?(?=")?")/, $string; print Dumper @string; =head output $VAR1 = '1945'; $VAR2 = ',"4,399.00"'; # note the leading commas retained and must be + reprocessed $VAR3 = ',938,1/10/2012'; # see =cut

    See the examples in split beginning at "If the PATTERN contains parentheses, additional list elements ...."

      how to avoid the leading comma generated as per your output.
Re: split string by comma
by perlfan (Vicar) on Jan 11, 2012 at 03:50 UTC
    I wouldn't use a split. If you're expecting four values separated by comma's in the format, then you could do something like:
    $line =~ m/^(.+),"(.+)",(.+),(.+)$/;
    As other have suggested, however, you may what to check out the Perl module for CSV.

      That regular expression is way too posessive. Think about how that would parse

      1,"foo",2,"bar, joy",3,3.14,pi,π

      Correct regular expressions have been posted in this thead, but when dealing with real CSV data (what about embedded newlines?), you will most likely end up with failure eventually when sticking to split or regular expressions. Please seriously consider using Text::CSV_XS or Text::CSV (which will use Text::CSV_XS when installed) and be done with it.

      Another thing seldom considered by US users is that the "." in those "values" is locale dependent. Consider what will happen if 3623494.92 is printed as 3,623,494.92 or printed/exported in Dutch local using both radix sep and triad sep from the locale. It will export as "3.623.494,92". Oh, the horror in "fixing" all those regular expressions :)


      Enjoy, Have FUN! H.Merijn
        In order to avoid failure with embedded newlines (or your other record-separator of choice), I use this:
        my $old_INPUT_RECORD_SEPARATOR = $/; $/ = $self->record_delimiter; open (DELIMFILE, '<', $filename) or (Carp::confess("Cannot open fi +le [$filename]: $!")); my $record; while (<DELIMFILE>) { chomp; $record = $_; # If a line contains an odd amount of doublequotes ("), then w +e'll need to continue reading until we find another line that contain +s an odd amount of doublequotes. # This is in order to catch fields that contain recordseparato +rs (but are encased in ""'s). if (grep ($_ eq '"', split ('', $_)) % 2 == 1) { # Keep reading data and appending to $record until we find + another line with an odd number of doublequotes. while (<DELIMFILE>) { $record .= $_; if (grep ($_ eq '"', split ('', $_)) % 2 == 1) { last; + } } } ## end if (grep ($_ eq '"', split...)) push (@{$ar_returnvalue}, ReadRecord($self, $record)); } ## end while (<DELIMFILE>) close (DELIMFILE); $/ = $old_INPUT_RECORD_SEPARATOR;
        And ReadRecord uses a regex to consume the string field by field:
        my $field_value; my $delimiter = $self->field_delimiter; while ($inputstring) { undef $field_value; if ($inputstring =~ /^"/) { $field_value = $inputstring; if ($inputstring =~ /^"(([^"]|"")+)"(?:[$delimiter]|$)/p) { ($field_value, $inputstring) = ($1, ${^POSTMATCH}); # Unescape escaped quotes $field_value =~ s/""/"/g; } else { Carp::confess("Parsing error with remaining data [$inputst +ring]"); } } else { $field_value = $inputstring; if ($inputstring =~ /^([^$delimiter"]*)(?:[$delimiter]|$)/p) { ($field_value, $inputstring) = ($1, ${^POSTMATCH}); } } ## end else [ if ($inputstring =~ /^"/)] }
        This conforms to RFC 4180 :)
      Thank you. That worked like a charm!!!
      Thank you much.
      Richard

        Just to close things up ... what was “that?”   What approach worked for you?

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://947251]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2020-10-25 06:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (249 votes). Check out past polls.

    Notices?