Re: Regular expression

But being a 'newbee' to perl I cant find a redundant way to do get the first line parsed. Could any one give me some suggestions/directions how to get them parsed.

Am I right in guessing your problem is that "one line is not like the others?" (well, actually 2). I note that your input has three types of lines each with a distinct beginning, matching the following regexes:

use strict;
use warnings;

#Word followed by :
my $EXPERIMENT_REGEX= qr(^\w+:);

#Word followed by word
my $LABELS_REGEX=qr(^Include\s+Color);

#Word followed by num
my $DATA_REGEX=qr(^\w+\s+\d+);
[download]

You can use these distinctions to choose which function should parse each line:

while (<DATA>) {
  chomp;            #strip new line assuming $\ = newline
  my $line = $_;    #document what $_ means in this context

  #choose what to do with a line based on how it begins

  if ($line =~ $EXPERIMENT_REGEX) {
     parse_experiment($line);
  } elsif ($line =~ $LABELS_REGEX) {
     parse_labels($line);
  } elsif ($line =~ $DATA_REGEX) {
     parse_data($line);
  }
}

#define parse routines here...

__DATA__
Experiment: rs5443-61902_923_922_921_291008 Active filters: FAM (483-5
+33), VIC /HEX / Yellow555 (523-568)
Include Color Pos Name 483-533 523-568 Call Score Status
True 3200768 A1 EPS316120 8.535 17.575 GG 0.90
True 3200768 A2 EPS318077 8.820 17.126 GG 0.95
True 255 A3 EPS316121 17.084 13.650 GA 0.97
True 3200768 A4 EPS318078 8.541 16.653 GG 0.94
True 16744448 A5 EPS316122 18.267 3.880 AA 1.00
True 255 A6 EPS318079 13.130 11.004 GA 0.91
True 3200768 A7 EPS316123 9.150 16.868 GG 1.00
True 3200768 A8 EPS318080 9.346 17.771 GG 0.97
True 3200768 A9 EPS316124 9.205 17.201 GG 0.98
True 3200768 A10 EPS318081 9.729 17.934 GG 1.00
[download]

In your parse_experiment(...) subroutine you would put further regexes to break up the experiment line and extract the data you need. To get the data you want you will need to call split(...) once with a regex describing the delimiter between the fields in your experiment record. Then for the particular field containing "rs...", call it again with '-' as the delimiter:

   my @aFields = split(/[\s:;]+/, $line);
   my $rscode = $aFields[1];   #assignment to document field use
   my @aRsFields = split(/[-_]/, $rscode);

   my $sRsNo = $aRsFields[0];         # 1st array element
   my $s619Thingy = $aRsFields[1];    # 2nd array element
   my $s291Thingy = $aRsFields[-1];   # last array element
[download]

Note 1: Since you say you are new to Perl I've put in extra comments. In your production code you should probably leave them out

Note 2: Although this looks like code, it is really meant as pseudo code. This code uses <DATA> as the input stream (that's the stuff below the __DATA__ token), but of course you will want to replace it with your real stream.

Also, I have no idea of what you need to do with the parse results so I haven't defined those subroutines. Most likely you will need to pass additional data to the your parsing subroutines or capture return values.

Note 3: This code makes an unrealistic assumption that your data has no junk in it. Most data does, so you would also need to supplement it with a lot of error checking code and possibly more elaborate regexes to handle wierd field delimiters, missing fields, and the like. But hopefully it will give you a general idea of how tobreak down the problem to suit your particular situation.

Best, beth

Comment on Re: Regular expression Select or Download Code


Think about Loose Coupling
	PerlMonks