Re: Regex Extraction Help

in reply to Regex Extraction Help

Your main issue seems to be that you fetch too much or the wrong things. Reminder: you need to put the thing you want to get in round brackets ... which is the \w\w\d\d\d\d\d in this case. So your regex could look like

 /DR\s+Pfam;\s+(\w\w\d\d\d\d\d);\s+/
[download]

(I don't see the need for look-aheads or or look-behinds here.)

HTH, Rata

update Flexx: why do you assume that the second field of an semicolon-seperated file is meant? I agree that the specification is very vague. However from the examples given by invaderzard, it seems that the text DR Pfam as well as the format (2 letters, 5 digits) are the important parts ...

Comment on Re: Regex Extraction Help Select or Download Code

Replies are listed 'Best First'.
Re^2: Regex Extraction Help by Flexx (Pilgrim) on Aug 09, 2012 at 17:02 UTC
Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field. So I'd suggest this: `# assuming the raw data is in $line. $line =~ m/^[^;];\s([^;]?)\s;/ # $1 now holds whatever is between the second and third # semicolon, leading and trailing spaces trimmed.` [download] Now, what am I doing here? First I say: Let's start at the beginning (`^`). This is important, since we can't exclude the possibility that the pattern repeats in one instance of `$line`. Next, I say: give me zero or more non-semicolon characters (`[^;]`), followed by exactly one semicolon (`;`). Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (`\s`). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon (`[^;]?`), but this time, non-greedily (using the `?` quantifier.). Well, that's because we want any trailing space to go into the `\s*` that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (`;`). If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it. Cheers, Flexx	[reply] [d/l] [select]
Re^2: Regex Extraction Help by Flexx (Pilgrim) on Aug 15, 2012 at 21:52 UTC
« Flexx: why do you assume that the second field of an semicolon-seperated file is meant? » Umm.. Well because it looks like a CSV format? Experience seeing a bit of a problem and getting what the requirement is (a/k/a "getting 'all' the information from the customer" ;)? And it appered that putting that first field in the regexp was more out of confusion as to how to "get to" the second field, something I do see often when someone learns how to use regular experessions. Along with too much use of `.` to pull in fields, BTW, when "not the separator" (`[^;]`) is often more correct, or even needed. Things get worse, once quoting is to be considered, of course. But yeah, it was just an educated guess. So long, Flexx	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom