Eisbar has asked for the wisdom of the Perl Monks concerning the following question:
Hi guys,
I need to split a string by commas, but excluding any commas between parethesis, for example:
this, that, those, these (not enough, nope, never), there
and get:
- this
- that
- those
- these (not enough, nope, never)
- there
I think I need to use lookaround assertions, but i dont understand them, can you gimme some light?
Re: Splitting a comma-delimited string where a substring could countain commas
by dws (Chancellor) on May 03, 2002 at 16:19 UTC
|
I need to split a string by commas, but excluding any commas between parethesis
Here's a start, which doesn't use lookahead assertions. It works on your tests case, but I would throw more tests cases at it before putting it into production.
local $_ = "this, that, those, these (not enough, nope, never), there"
+;
while ( /(?:^|, )([^,]+\(.*?\)|[^,]+)/g ) {
print $1, "\n";
}
You have to understand a bit about backtracking to get how this works. It proceeds by trying to match, in this order
- at the beginning of a string, a word followed by a parenthetical
- at the beginning of a string, a word
- following ", ", a word followed by a parenthetical
- following ", ", a word
| [reply] [d/l] |
|
That looks pretty good, but it doesn't deal with multiple levels of parens. I think Text::Balanced is really the better solution.
-sam
| [reply] |
|
| [reply] |
|
local $_ = "this, (that, those), these ((not enough, (nope)), never),
+there";
(my $re=$_)=~s/((\()|(\))|.)/${[')','']}[!$3]\Q$1\E${['(','']}[!$2]/gs
+;
$re= join'|',map{quotemeta}eval{/$re/};
die $@ if $@ =~ /unmatched/;
while( /((?:$re|[^,])*)/g ){
print "$1\n";
}
| [reply] [d/l] |
Re: regex problems
by grep (Monsignor) on May 03, 2002 at 16:21 UTC
|
You're going to want to treat this as CSV and use the module Text::CSV_XS. A regex is not as well suited to parsing data as a real parser is (i.e. what if you data has quotes, how do you want it to act?).
grep
Unix - where you can thrown the manual on the keyboard and get a command |
| [reply] |
|
Can you show an example that works? I don't think Text::CSV_XS will work with embedded, unescaped, commas in a CSV.
-sam
| [reply] |
Re: Splitting a comma-delimited string where a substring could countain commas
by mrbbking (Hermit) on May 03, 2002 at 17:27 UTC
|
My first thought was Text::CSV as well, but I'm not sure it'll help you here. You don't have true 'comma separated values' format. CSV does not use parens to group items - it uses a single character. Parens work in pairs.
If you have any control over the format, you might consider changing it to match the CSV spec - something standard. Then Text::CSV will help you. The example below is only slightly modified from the examples in the POD
#!/usr/bin/perl -w
use strict;
use Text::CSV_XS;
while( <DATA> ){
my $line = $_;
my @input;
my $csv = Text::CSV_XS->new({ # defaults are: ["]["][,][0]
quote_char => '"',
escape_char => '"',
sep_char => ',',
binary => 0
});
if( $csv->parse( $line ) ){
@input = $csv->fields;
} else {
my $err = $csv->error_input;
warn "Text::CSV_XS->parse() failed in line $. on argument '"
, $err, "'\n";
}
foreach my $item (@input){
print "$item\n";
}
print "\n";
}
# first line parses 'correctly' - second does not.
__DATA__
this,that,those,"these (not enough, nope, never)",there
this, that, those, these (not enough, nope, never), there
| [reply] [d/l] |
|
| [reply] [d/l] |
|
If the whole field were in parenthesis, you're right, that would work.
But the value is this:
, these (not enough, nope, never),
...not this...
, (these not enough, nope, never),
Text::CSV_XS chokes if you replace the parens with your tr/// suggestion. CSV requires that either the whole field or none of the field be quoted - you can't quote part of a field.
| [reply] |
|
Re: Splitting a comma-delimited string where a substring could countain commas
by erikharrison (Deacon) on May 03, 2002 at 17:42 UTC
|
use perl6 (:regexes); # :-)
Several people have mentioned CSV, but I think your real solution is probabaly to use Test::Balenced to take out the parens properly, and then use a regex to split the data up. Text::Balenced is wildly useful, so learning for this should pay back for other parsing needs. Regexes alone often aren't enough for parsing (at least, not if you want maintainable code).
Cheers,
Erik | [reply] [d/l] |
Re: Splitting a comma-delimited string where a substring could countain commas
by mothra (Hermit) on May 03, 2002 at 18:28 UTC
|
I'm curious as to why you ended up in this situation to begin with.
- Why do you need to have them split that way? (What is the ultimate goal you're trying to achieve using that data?)
- Do you have any control over how the initial data is structured? Smarter data structures make for easier maintenance.
I know this isn't the "answer" you were looking for, but if you can change the format of the data to something easier to work with, or if you can solve your problem without even having to parse it the way you think you need to, your maintenance programmer will thank you. | [reply] |
|
1) because I want to store them in a database, each column represents a field. 2) Nope it was an excel sheet, I exported it to CSV. I can change the format, but that is what I want to avoid, because i would have to do it manualy
| [reply] |
Re: Splitting a comma-delimited string where a substring could countain commas
by arunhorne (Pilgrim) on May 03, 2002 at 18:43 UTC
|
For what its worth, one way of doing this is to keep a count of how many brackets are open and process the string character by character, splitting the string when a comma is encountered iff the bracket open count is zero.
However it strikes me that this is not a particularly perl-ish way to solve the problem although if you are interested I have written such code as part of a Java compiler I wrote I while back
Abh
| [reply] |
Re: Splitting a comma-delimited string where a substring could countain commas
by mephit (Scribe) on May 03, 2002 at 18:43 UTC
|
I had a similar problem a while back, except I was concerned with quotes, not parens. Text::ParseWords helped me out quite a bit. I haven't really looked at the code for that module, but maybe it can give you an idea or two?
| [reply] |
Re: Splitting a comma-delimited string where a substring could contain commas
by Eisbar (Novice) on May 06, 2002 at 16:25 UTC
|
well, thanks for all your answers guys, I fixed it myself this way:
my @temp;
if (/(.*)(\([\d\w\s,\.]+?\))(.*)/) {
@temp = split /,/, $1;
$temp[@temp-1] .= $2;
my $last = $3;
$last =~ s/^,+//;
my @temp2 = split /,/, $last;
push (@temp, @temp2);
} else {
@temp = split /,/, $_;
}
I know it's not generic, but it is something I only had to do once | [reply] [d/l] |
|
|