Regular Expression to Extract Anything from Colon Delimited String

GuiPerl has asked for the wisdom of the Perl Monks concerning the following question:

I have put together some code to parse a colon delimited data file. The regular expression I have built traps some of the colon delimited values. In any case, I would like some pointers with the regexp pattern. An all purpose matching of anything in-between the colon delimited string would be ideal in addition to matching strings that contain fewer data values as can be seen below:

foreach (<DATA>) {
    

if($_ =~ m/(\d{1,2})?\:?(\w\d)?\:?(\b\w..*\b)?\:?(.*|N\/A)?\:?(\d{1,2}
+.+)?\:?(\d{2}\s?GREEN|RED|XX)?\:?(.*)?\:?(.*)?\:?(\bsquare\b)?/) {


#if($_ =~ $_ =~ m/(\d{1,2})\:?(\w\d)?\:?(\b\w..*\b)?\:?(.*|N\/A)?\|\|?
+(\d{1,2}.+)?\:?(\d{2}\s?GREEN|RED|XX)?\:?(.*)?\:?(.*)?\:?(\bYELLOW\b)
+?/) {


if (defined $1) {
    
    $count=$1;
    
}

else {
    
    $count="nothing";
}

if (defined $2) {
    #code
    $grade=$2;
}

else {
    
    $grade="nothing";
}


if (defined $3) {
    #code
    $pos=$3;
}

else {
    
    $pos="nothing";
}




if (defined $4) {
    #code
    $name=$4;
}

else {
    
    $name="nothing";
}


if (defined $5) {
    #code
    $country=$5;
}

else {
    
    $country="nothing";
}


if (defined $6) {
    #code
    $date=$6;
}

else {
    
    $date="nothing";
}


if (defined $7) {
    #code
    $age=$7;
}

else {
    
    $age="nothing";
}


if (defined $8) {
    #code
    $vacant=$8;
}

else {
    
    $vacant="nothing";
}

if (defined $9) {
    #code
    $square=$9;
}

else {
    
    $count="nothing";
}








#print "We have a match!\n";
print join " ",$count,$grade,$pos,$name,$date,$country,$age,$vacant,"\
+n";

}

}

__DATA__
1:D2:DIRECTOR:D. Green:4/15/1953:61 XX:UNITED KINGDOM OF GREAT BRITAIN
+ AND NORTHERN IRELAND::::
1:D1:DEPUTY DIRECTOR:D. Green::6/20/1964:50:TUNISIA REPUBLIC OF::::
1:P5:SENIOR POLICY OFFICER:D. Green::7/7/1954:60 GREEN:UNITED KINGDOM 
+OF GREAT BRITAIN AND NORTHERN IRELAND::::
9:P5:SENIOR ECONOMIST:D. Green::7/23/1958:56:UNITED KINGDOM OF GREAT B
+RITAIN AND NORTHERN IRELAND::::
D. Green::10/29/1953:60 GREEN:PERU REPUBLIC OF:*:::
D. Green::10/26/1955:58:SPAIN KINGDOM OF:*:::
D. Green::5/15/1967:47:FRENCH REPUBLIC::::
D. Green:g:12/6/1954:59:FIJI REPUBLIC OF::::
D. Green::6/8/1967:47:UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRE
+LAND::::
D. Green::9/16/1960:54:UNITED STATES OF AMERICA::::
N/A::Vacant:UNASSIGNED::YELLOW::
[download]

Output from above:

nothing D2 DIRECTOR:D. Green:4/15/1953:61 XX:UNITED KINGDOM OF GREAT B
+RITAIN AND NORTHERN IRELAND ::: nothing nothing   
nothing D1 DEPUTY DIRECTOR:D. Green::6/20/1964:50:TUNISIA REPUBLIC OF 
+::: nothing nothing   
nothing P5 SENIOR POLICY OFFICER:D. Green::7/7/1954:60 GREEN:UNITED KI
+NGDOM OF GREAT BRITAIN AND NORTHERN IRELAND ::: nothing nothing   
nothing P5 SENIOR ECONOMIST:D. Green::7/23/1958:56:UNITED KINGDOM OF G
+REAT BRITAIN AND NORTHERN IRELAND ::: nothing nothing   
nothing nothing D. Green::10/29/1953:60 GREEN:PERU REPUBLIC OF *::: no
+thing nothing   
nothing nothing D. Green::10/26/1955:58:SPAIN KINGDOM OF *::: nothing 
+nothing   
nothing nothing D. Green::5/15/1967:47:FRENCH REPUBLIC ::: nothing not
+hing   
nothing nothing D. Green:g:12/6/1954:59:FIJI REPUBLIC OF ::: nothing n
+othing   
nothing nothing D. Green::6/8/1967:47:UNITED KINGDOM OF GREAT BRITAIN 
+AND NORTHERN IRELAND ::: nothing nothing   
nothing nothing D. Green::9/16/1960:54:UNITED STATES OF AMERICA ::: no
+thing nothing   
nothing nothing N/A::Vacant:UNASSIGNED::YELLOW : nothing nothing   
nothing nothing nothing  nothing nothing
[download]

Many thanks

Comment on Regular Expression to Extract Anything from Colon Delimited String Select or Download Code

Replies are listed 'Best First'.
Re: Regular Expression to Extract Anything from Colon Delimited String by AnomalousMonk (Archbishop) on Oct 01, 2014 at 19:14 UTC
I guess my knee-jerk responses would be Text::CSV or Text::CSV_XS. Alternatively, would there be any objection to making use of split? (You seem well on the way to creating a massive headache for yourself.)	[reply]
Re: Regular Expression to Extract Anything from Colon Delimited String by McA (Priest) on Oct 01, 2014 at 19:12 UTC
Hi, I'm not sure whether I misunderstood your question. But have you looked at function `split` for your purpose? Regards McA	[reply] [d/l]
Re^2: Regular Expression to Extract Anything from Colon Delimited String by GuiPerl (Acolyte) on Oct 01, 2014 at 19:35 UTC
This did the trick: `my ($a,$b,$c,$d,$e,$f,$g,$h,$i) =split(/:/,$line);` [download] Thanks.	[reply] [d/l]
Re^3: Regular Expression to Extract Anything from Colon Delimited String by Solo (Deacon) on Oct 01, 2014 at 19:55 UTC
*Mapping `$a, $b, $c, $d, $e, $f, $g, $h, $i` to `$count, $grade, $pos, $name, $date, $country, $age, $vacant` is left as an exercise for the reader.	[reply] [d/l] [select]
Re: Regular Expression to Extract Anything from Colon Delimited String by davido (Cardinal) on Oct 01, 2014 at 19:50 UTC
If the data is as trivial as it seems, `split /:/, ...` will probably be fine. It seems that's what you've decided to go with. If the data format spec allows for things like quoted fields, and embedded colons or newlines within quoted fields, then a proper CSV parser would be preferable. Text::CSV or Text::CSV_XS can be configured to use a colon as the delimiter, and to permit quoted fields, escaped delimiters, and embedded newlines. It's probably irrelevant; your data may never become "tricky." If it does, be aware of the potential pitfalls, and of the options available to you. It might be wise to design your application's input parsing with a very thin abstraction layer so that it would be easy to plug in a real CSV parser if it becomes necessary in the future. Dave	[reply] [d/l]
Re: Regular Expression to Extract Anything from Colon Delimited String by Your Mother (Archbishop) on Oct 01, 2014 at 19:50 UTC
A snippet/approach for your bag of tricks (and in this case help you sort out if your data structures drift; if so, this approach won’t be enough)– `use strictures; use Data::Dump "dump"; my @col = qw( count grade position name date country age vacant ); my @records; for ( <DATA> ) { my %tmp; @tmp{@col} = split ":"; push @records, \%tmp; } dump \@records;` [download]	[reply] [d/l]
Re: Regular Expression to Extract Anything from Colon Delimited String by dorko (Prior) on Oct 01, 2014 at 19:31 UTC
What about Text::xSV? The documentation even mentions Colon Seperated Value files. Cheers, Brent -- Yeah, I'm a Delt.	[reply]
Re: Regular Expression to Extract Anything from Colon Delimited String by Solo (Deacon) on Oct 01, 2014 at 19:47 UTC
The sample data you provide seems to fall into 3 different structures. It makes sense to me to first try to identify which structure a given line is, then parse the specific structure (by splitting into the correct slice of a hash, for example.) If the sample data here is unrepresentative, another approach would be to separately define what each field should look like, then build full regexp's out of the pieces. For that approach look at Regexp::Assemble or dig into the sources for Regexp::Common.	[reply]

Back to Seekers of Perl Wisdom