http://qs321.pair.com?node_id=1102573

GuiPerl has asked for the wisdom of the Perl Monks concerning the following question:

I have put together some code to parse a colon delimited data file. The regular expression I have built traps some of the colon delimited values. In any case, I would like some pointers with the regexp pattern. An all purpose matching of anything in-between the colon delimited string would be ideal in addition to matching strings that contain fewer data values as can be seen below:
foreach (<DATA>) { if($_ =~ m/(\d{1,2})?\:?(\w\d)?\:?(\b\w..*\b)?\:?(.*|N\/A)?\:?(\d{1,2} +.+)?\:?(\d{2}\s?GREEN|RED|XX)?\:?(.*)?\:?(.*)?\:?(\bsquare\b)?/) { #if($_ =~ $_ =~ m/(\d{1,2})\:?(\w\d)?\:?(\b\w..*\b)?\:?(.*|N\/A)?\|\|? +(\d{1,2}.+)?\:?(\d{2}\s?GREEN|RED|XX)?\:?(.*)?\:?(.*)?\:?(\bYELLOW\b) +?/) { if (defined $1) { $count=$1; } else { $count="nothing"; } if (defined $2) { #code $grade=$2; } else { $grade="nothing"; } if (defined $3) { #code $pos=$3; } else { $pos="nothing"; } if (defined $4) { #code $name=$4; } else { $name="nothing"; } if (defined $5) { #code $country=$5; } else { $country="nothing"; } if (defined $6) { #code $date=$6; } else { $date="nothing"; } if (defined $7) { #code $age=$7; } else { $age="nothing"; } if (defined $8) { #code $vacant=$8; } else { $vacant="nothing"; } if (defined $9) { #code $square=$9; } else { $count="nothing"; } #print "We have a match!\n"; print join " ",$count,$grade,$pos,$name,$date,$country,$age,$vacant,"\ +n"; } } __DATA__ 1:D2:DIRECTOR:D. Green:4/15/1953:61 XX:UNITED KINGDOM OF GREAT BRITAIN + AND NORTHERN IRELAND:::: 1:D1:DEPUTY DIRECTOR:D. Green::6/20/1964:50:TUNISIA REPUBLIC OF:::: 1:P5:SENIOR POLICY OFFICER:D. Green::7/7/1954:60 GREEN:UNITED KINGDOM +OF GREAT BRITAIN AND NORTHERN IRELAND:::: 9:P5:SENIOR ECONOMIST:D. Green::7/23/1958:56:UNITED KINGDOM OF GREAT B +RITAIN AND NORTHERN IRELAND:::: D. Green::10/29/1953:60 GREEN:PERU REPUBLIC OF:*::: D. Green::10/26/1955:58:SPAIN KINGDOM OF:*::: D. Green::5/15/1967:47:FRENCH REPUBLIC:::: D. Green:g:12/6/1954:59:FIJI REPUBLIC OF:::: D. Green::6/8/1967:47:UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRE +LAND:::: D. Green::9/16/1960:54:UNITED STATES OF AMERICA:::: N/A::Vacant:UNASSIGNED::YELLOW::
Output from above:
nothing D2 DIRECTOR:D. Green:4/15/1953:61 XX:UNITED KINGDOM OF GREAT B +RITAIN AND NORTHERN IRELAND ::: nothing nothing nothing D1 DEPUTY DIRECTOR:D. Green::6/20/1964:50:TUNISIA REPUBLIC OF +::: nothing nothing nothing P5 SENIOR POLICY OFFICER:D. Green::7/7/1954:60 GREEN:UNITED KI +NGDOM OF GREAT BRITAIN AND NORTHERN IRELAND ::: nothing nothing nothing P5 SENIOR ECONOMIST:D. Green::7/23/1958:56:UNITED KINGDOM OF G +REAT BRITAIN AND NORTHERN IRELAND ::: nothing nothing nothing nothing D. Green::10/29/1953:60 GREEN:PERU REPUBLIC OF *::: no +thing nothing nothing nothing D. Green::10/26/1955:58:SPAIN KINGDOM OF *::: nothing +nothing nothing nothing D. Green::5/15/1967:47:FRENCH REPUBLIC ::: nothing not +hing nothing nothing D. Green:g:12/6/1954:59:FIJI REPUBLIC OF ::: nothing n +othing nothing nothing D. Green::6/8/1967:47:UNITED KINGDOM OF GREAT BRITAIN +AND NORTHERN IRELAND ::: nothing nothing nothing nothing D. Green::9/16/1960:54:UNITED STATES OF AMERICA ::: no +thing nothing nothing nothing N/A::Vacant:UNASSIGNED::YELLOW : nothing nothing nothing nothing nothing nothing nothing
Many thanks

Replies are listed 'Best First'.
Re: Regular Expression to Extract Anything from Colon Delimited String
by AnomalousMonk (Archbishop) on Oct 01, 2014 at 19:14 UTC

    I guess my knee-jerk responses would be Text::CSV or Text::CSV_XS. Alternatively, would there be any objection to making use of split? (You seem well on the way to creating a massive headache for yourself.)

Re: Regular Expression to Extract Anything from Colon Delimited String
by McA (Priest) on Oct 01, 2014 at 19:12 UTC

    Hi,

    I'm not sure whether I misunderstood your question. But have you looked at function split for your purpose?

    Regards
    McA

      This did the trick:
      my ($a,$b,$c,$d,$e,$f,$g,$h,$i) =split(/:/,$line);
      Thanks.
        *Mapping $a, $b, $c, $d, $e, $f, $g, $h, $i to $count, $grade, $pos, $name, $date, $country, $age, $vacant is left as an exercise for the reader.
Re: Regular Expression to Extract Anything from Colon Delimited String
by davido (Cardinal) on Oct 01, 2014 at 19:50 UTC

    If the data is as trivial as it seems, split /:/, ... will probably be fine. It seems that's what you've decided to go with.

    If the data format spec allows for things like quoted fields, and embedded colons or newlines within quoted fields, then a proper CSV parser would be preferable. Text::CSV or Text::CSV_XS can be configured to use a colon as the delimiter, and to permit quoted fields, escaped delimiters, and embedded newlines.

    It's probably irrelevant; your data may never become "tricky." If it does, be aware of the potential pitfalls, and of the options available to you. It might be wise to design your application's input parsing with a very thin abstraction layer so that it would be easy to plug in a real CSV parser if it becomes necessary in the future.


    Dave

Re: Regular Expression to Extract Anything from Colon Delimited String
by Your Mother (Archbishop) on Oct 01, 2014 at 19:50 UTC

    A snippet/approach for your bag of tricks (and in this case help you sort out if your data structures drift; if so, this approach won’t be enough)–

    use strictures; use Data::Dump "dump"; my @col = qw( count grade position name date country age vacant ); my @records; for ( <DATA> ) { my %tmp; @tmp{@col} = split ":"; push @records, \%tmp; } dump \@records;
Re: Regular Expression to Extract Anything from Colon Delimited String
by dorko (Prior) on Oct 01, 2014 at 19:31 UTC
    What about Text::xSV? The documentation even mentions Colon Seperated Value files.

    Cheers,

    Brent

    -- Yeah, I'm a Delt.
Re: Regular Expression to Extract Anything from Colon Delimited String
by Solo (Deacon) on Oct 01, 2014 at 19:47 UTC

    The sample data you provide seems to fall into 3 different structures. It makes sense to me to first try to identify which structure a given line is, then parse the specific structure (by splitting into the correct slice of a hash, for example.)

    If the sample data here is unrepresentative, another approach would be to separately define what each field should look like, then build full regexp's out of the pieces. For that approach look at Regexp::Assemble or dig into the sources for Regexp::Common.