Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Splitting a comma-delimited string where a substring could contain commas

by Eisbar (Novice)
on May 03, 2002 at 15:56 UTC ( [id://163826]=perlquestion: print w/replies, xml ) Need Help??

Eisbar has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I need to split a string by commas, but excluding any commas between parethesis, for example:

this, that, those, these (not enough, nope, never), there

and get:

  • this
  • that
  • those
  • these (not enough, nope, never)
  • there

I think I need to use lookaround assertions, but i dont understand them, can you gimme some light?

  • Comment on Splitting a comma-delimited string where a substring could contain commas

Replies are listed 'Best First'.
Re: Splitting a comma-delimited string where a substring could countain commas
by dws (Chancellor) on May 03, 2002 at 16:19 UTC
    I need to split a string by commas, but excluding any commas between parethesis

    Here's a start, which doesn't use lookahead assertions. It works on your tests case, but I would throw more tests cases at it before putting it into production.

    local $_ = "this, that, those, these (not enough, nope, never), there" +; while ( /(?:^|, )([^,]+\(.*?\)|[^,]+)/g ) { print $1, "\n"; }
    You have to understand a bit about backtracking to get how this works. It proceeds by trying to match, in this order
    1. at the beginning of a string, a word followed by a parenthetical
    2. at the beginning of a string, a word
    3. following ", ", a word followed by a parenthetical
    4. following ", ", a word

      That looks pretty good, but it doesn't deal with multiple levels of parens. I think Text::Balanced is really the better solution.

      -sam

        ... but it doesn't deal with multiple levels of parens.

        Coding now to deal with nested parens would be solving a problem that hasn't been presented. There might or might not be nested parens in the data. I'd wait for the "customer" to clarify their requirements before hitting this with a larger hammer. YMMV.

        local $_ = "this, (that, those), these ((not enough, (nope)), never), +there"; (my $re=$_)=~s/((\()|(\))|.)/${[')','']}[!$3]\Q$1\E${['(','']}[!$2]/gs +; $re= join'|',map{quotemeta}eval{/$re/}; die $@ if $@ =~ /unmatched/; while( /((?:$re|[^,])*)/g ){ print "$1\n"; }
Re: regex problems
by grep (Monsignor) on May 03, 2002 at 16:21 UTC
    You're going to want to treat this as CSV and use the module Text::CSV_XS. A regex is not as well suited to parsing data as a real parser is (i.e. what if you data has quotes, how do you want it to act?).

    grep
    Unix - where you can thrown the manual on the keyboard and get a command
      Can you show an example that works? I don't think Text::CSV_XS will work with embedded, unescaped, commas in a CSV.

      -sam

Re: Splitting a comma-delimited string where a substring could countain commas
by mrbbking (Hermit) on May 03, 2002 at 17:27 UTC
    My first thought was Text::CSV as well, but I'm not sure it'll help you here. You don't have true 'comma separated values' format. CSV does not use parens to group items - it uses a single character. Parens work in pairs.

    If you have any control over the format, you might consider changing it to match the CSV spec - something standard. Then Text::CSV will help you. The example below is only slightly modified from the examples in the POD

    #!/usr/bin/perl -w use strict; use Text::CSV_XS; while( <DATA> ){ my $line = $_; my @input; my $csv = Text::CSV_XS->new({ # defaults are: ["]["][,][0] quote_char => '"', escape_char => '"', sep_char => ',', binary => 0 }); if( $csv->parse( $line ) ){ @input = $csv->fields; } else { my $err = $csv->error_input; warn "Text::CSV_XS->parse() failed in line $. on argument '" , $err, "'\n"; } foreach my $item (@input){ print "$item\n"; } print "\n"; } # first line parses 'correctly' - second does not. __DATA__ this,that,those,"these (not enough, nope, never)",there this, that, those, these (not enough, nope, never), there
      Excluding embeded paranthesis (which the poster did not mentioned), tr%()%""% would solve the grouping problem for CSV.

      --
      perl -pew "s/\b;([mnst])/'$1/g"

        If the whole field were in parenthesis, you're right, that would work.

        But the value is this:
        , these (not enough, nope, never),
        ...not this...
        , (these not enough, nope, never),

        Text::CSV_XS chokes if you replace the parens with your tr/// suggestion. CSV requires that either the whole field or none of the field be quoted - you can't quote part of a field.

Re: Splitting a comma-delimited string where a substring could countain commas
by erikharrison (Deacon) on May 03, 2002 at 17:42 UTC
    use perl6 (:regexes); # :-)

    Several people have mentioned CSV, but I think your real solution is probabaly to use Test::Balenced to take out the parens properly, and then use a regex to split the data up. Text::Balenced is wildly useful, so learning for this should pay back for other parsing needs. Regexes alone often aren't enough for parsing (at least, not if you want maintainable code).

    Cheers,
    Erik
Re: Splitting a comma-delimited string where a substring could countain commas
by mothra (Hermit) on May 03, 2002 at 18:28 UTC
    I'm curious as to why you ended up in this situation to begin with.

    1. Why do you need to have them split that way? (What is the ultimate goal you're trying to achieve using that data?)

    2. Do you have any control over how the initial data is structured? Smarter data structures make for easier maintenance.

    I know this isn't the "answer" you were looking for, but if you can change the format of the data to something easier to work with, or if you can solve your problem without even having to parse it the way you think you need to, your maintenance programmer will thank you.

      1) because I want to store them in a database, each column represents a field.

      2) Nope it was an excel sheet, I exported it to CSV.

      I can change the format, but that is what I want to avoid, because i would have to do it manualy

Re: Splitting a comma-delimited string where a substring could countain commas
by arunhorne (Pilgrim) on May 03, 2002 at 18:43 UTC

    For what its worth, one way of doing this is to keep a count of how many brackets are open and process the string character by character, splitting the string when a comma is encountered iff the bracket open count is zero.

    However it strikes me that this is not a particularly perl-ish way to solve the problem although if you are interested I have written such code as part of a Java compiler I wrote I while back

    Abh

Re: Splitting a comma-delimited string where a substring could countain commas
by mephit (Scribe) on May 03, 2002 at 18:43 UTC
    I had a similar problem a while back, except I was concerned with quotes, not parens. Text::ParseWords helped me out quite a bit. I haven't really looked at the code for that module, but maybe it can give you an idea or two?
Re: Splitting a comma-delimited string where a substring could contain commas
by Eisbar (Novice) on May 06, 2002 at 16:25 UTC

    well, thanks for all your answers guys, I fixed it myself this way:

    my @temp; if (/(.*)(\([\d\w\s,\.]+?\))(.*)/) { @temp = split /,/, $1; $temp[@temp-1] .= $2; my $last = $3; $last =~ s/^,+//; my @temp2 = split /,/, $last; push (@temp, @temp2); } else { @temp = split /,/, $_; }

    I know it's not generic, but it is something I only had to do once

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://163826]
Approved by DaWolf
Front-paged by rbc
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-19 12:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found