Regular expression

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular expression by CountZero (Bishop) on Feb 09, 2009 at 11:24 UTC
The first thing you should ask yourself is whether the line to be split has its fields always on the same position ("fixed length fields") or whether the fields are somehow delimited by some or other "marker" (such as -perhaps- the space, underscore character or the minus-sign). If it is fixed length, `unpack` is where you have to look first. If the record has delimited fields (of possibly varying lengths), `split` or a regular expression is more likely to help you. A tentative solution for delimited fields is: `use strict; my $line = 'Experiment: rs5443-61902_923_922_921_291008 Active filters: FAM + (483-533), VIC /HEX / Yellow555 (523-568)'; my ( undef, $first, $second, undef, undef, undef, $third, undef ) = split /[- _]/, $line; print "First: $first\nSecond: $second\nThird: $third\n";` [download] Which gives: `First: rs5443 Second: 61902 Third: 291008` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Regular expression by ELISHEVA (Prior) on Feb 09, 2009 at 13:00 UTC
But being a 'newbee' to perl I cant find a redundant way to do get the first line parsed. Could any one give me some suggestions/directions how to get them parsed. Am I right in guessing your problem is that "one line is not like the others?" (well, actually 2). I note that your input has three types of lines each with a distinct beginning, matching the following regexes: `use strict; use warnings; #Word followed by : my $EXPERIMENT_REGEX= qr(^\w+:); #Word followed by word my $LABELS_REGEX=qr(^Include\s+Color); #Word followed by num my $DATA_REGEX=qr(^\w+\s+\d+);` [download] You can use these distinctions to choose which function should parse each line: while (<DATA>) { chomp; #strip new line assuming $\ = newline my $line = $_; #document what $_ means in this context #choose what to do with a line based on how it begins if ($line =~ $EXPERIMENT_REGEX) { parse_experiment($line); } elsif ($line =~ $LABELS_REGEX) { parse_labels($line); } elsif ($line =~ $DATA_REGEX) { parse_data($line); } } #define parse routines here... __DATA__ Experiment: rs5443-61902_923_922_921_291008 Active filters: FAM (483-5 +33), VIC /HEX / Yellow555 (523-568) Include Color Pos Name 483-533 523-568 Call Score Status True 3200768 A1 EPS316120 8.535 17.575 GG 0.90 True 3200768 A2 EPS318077 8.820 17.126 GG 0.95 True 255 A3 EPS316121 17.084 13.650 GA 0.97 True 3200768 A4 EPS318078 8.541 16.653 GG 0.94 True 16744448 A5 EPS316122 18.267 3.880 AA 1.00 True 255 A6 EPS318079 13.130 11.004 GA 0.91 True 3200768 A7 EPS316123 9.150 16.868 GG 1.00 True 3200768 A8 EPS318080 9.346 17.771 GG 0.97 True 3200768 A9 EPS316124 9.205 17.201 GG 0.98 True 3200768 A10 EPS318081 9.729 17.934 GG 1.00 [download] In your `parse_experiment(...)` subroutine you would put further regexes to break up the experiment line and extract the data you need. To get the data you want you will need to call `split(...)` once with a regex describing the delimiter between the fields in your experiment record. Then for the particular field containing "rs...", call it again with '-' as the delimiter: `my @aFields = split(/[\s:;]+/, $line); my $rscode = $aFields[1]; #assignment to document field use my @aRsFields = split(/[-_]/, $rscode); my $sRsNo = $aRsFields[0]; # 1st array element my $s619Thingy = $aRsFields[1]; # 2nd array element my $s291Thingy = $aRsFields[-1]; # last array element` [download] Note 1: Since you say you are new to Perl I've put in extra comments. In your production code you should probably leave them out Note 2: Although this looks like code, it is really meant as pseudo code. This code uses <DATA> as the input stream (that's the stuff below the `__DATA__` token), but of course you will want to replace it with your real stream. Also, I have no idea of what you need to do with the parse results so I haven't defined those subroutines. Most likely you will need to pass additional data to the your parsing subroutines or capture return values. Note 3: This code makes an unrealistic assumption that your data has no junk in it. Most data does, so you would also need to supplement it with a lot of error checking code and possibly more elaborate regexes to handle wierd field delimiters, missing fields, and the like. But hopefully it will give you a general idea of how tobreak down the problem to suit your particular situation. Best, beth	[reply] [d/l] [select]
Re: Regular expression by JavaFan (Canon) on Feb 09, 2009 at 11:11 UTC
I would like to parseout rs5443, 61902 and the 291008. Easy. Millions of ways to do so. This could vary in each file. Yes, well, there's the catch, isn't it? If you can tell how it can vary from file to file, you accomplish two things: You're more than halfway finding the answer yourself. You increase the chance that a suggestion offered here will actually works for a different file than the example you provide. A win-win situation!	[reply]
Re: Regular expression by Bloodnok (Vicar) on Feb 09, 2009 at 18:14 UTC
Does the line you which to parse always follow the same format i.e. begins with the string Experiment: and you always want the 2nd, underscore delimited string ? If it's only the values in this 2nd string that '...could vary in each file', then... while (<DATA>) { next unless /^Experiment:/; my ($first, $second, undef, undef, undef, $last) = split /[-_]/, ( +split)[1]; warn "$first, $second, $last"; last; } __DATA__ Experiment: rs5443-61902_923_922_921_291008 Active filters: FAM (483-5 +33), VIC /HEX / Yellow555 (523-568) Include Color Pos Name 483-533 523-568 Call Score Status True 3200768 A1 EPS316120 8.535 17.575 GG 0.90 True 3200768 A2 EPS318077 8.820 17.126 GG 0.95 True 255 A3 EPS316121 17.084 13.650 GA 0.97 True 3200768 A4 EPS318078 8.541 16.653 GG 0.94 True 16744448 A5 EPS316122 18.267 3.880 AA 1.00 True 255 A6 EPS318079 13.130 11.004 GA 0.91>br> True 3200768 A7 EPS316 +123 9.150 16.868 GG 1.00 True 3200768 A8 EPS318080 9.346 17.771 GG 0.97 True 3200768 A9 EPS316124 9.205 17.201 GG 0.98 True 3200768 A10 EPS318081 9.729 17.934 GG 1.00 [download] returns `rs5443, 61902, 291008 at tst.pl line 8, <DATA> line 1.` [download] as required ?? A user level that continues to overstate my experience :-))	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks