use Modern::Perl;
my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.';
my $info = (split '; ', $dat)[1];
say $info;
Output:
PF00070
Hope this helps! | [reply] [Watch: Dir/Any] [d/l] [select] |
invaderzard, just wanted to make clear that this solution by Kenosis is the far quicker and easier version, which I'd, of course use anytime I'd just need a quick split by a field separator on an input.
But: There is one caveat here to keep in mind. Split, of course, does not test the format of the input. So if you wanted the second field of a record that goes like this:
$record = 'A;B;C;D';
then
$second_field = (split ';', $record)[1];
does work. However so it does for inputs like:
A;B
#foo;B;ar
;B
All of the above inputs would leave a B in $second_field. Which, you know might be correct in a particular case, but in general, we don't want to just ignore malformed records, so if we, say, iterate over records, then make sure to test and capture using a regexp in an if:
if($record =~ m/^.;(.);.;.$/) {
$second_field = $1;
}
Now this will only set $second_field if the record matches the four single-character fields delimited by one semicolon format. Even if the input is ';;;;;;;'. ;)
Have fun with regexen. They're cool. ;)
So long,
Flexx
| [reply] [Watch: Dir/Any] [d/l] [select] |
You make a good point about splitting on a record separator within possibly malformed records. Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon. However, we can ask split to 'test' the format of the input, like this:
my $info = (split /\s*;\s*/, $dat)[1];
This will return the info the OP wants, whether there are spaces before or after the semi-colon, or not.
And within a regex on the OP's data:
use Modern::Perl;
my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.';
$dat =~ /;\s*(\w+)\s*;.+;/ and say $1; #prints PF00070
It was a good call to address this issue...
| [reply] [Watch: Dir/Any] [d/l] [select] |
Hi
Your main issue seems to be that you fetch too much or the wrong things. Reminder: you need to put the thing you want to get in round brackets ... which is the \w\w\d\d\d\d\d in this case. So your regex could look like
/DR\s+Pfam;\s+(\w\w\d\d\d\d\d);\s+/
(I don't see the need for look-aheads or or look-behinds here.)
HTH, Rata
update Flexx: why do you assume that the second field of an semicolon-seperated file is meant?
I agree that the specification is very vague. However from the examples given by invaderzard, it seems
that the text DR Pfam as well as the format (2 letters, 5 digits) are the important parts ...
| [reply] [Watch: Dir/Any] [d/l] [select] |
Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.
So I'd suggest this:
# assuming the raw data is in $line.
$line =~ m/^[^;]*;\s*([^;]*?)\s*;/
# $1 now holds whatever is between the second and third
# semicolon, leading and trailing spaces trimmed.
Now, what am I doing here?
First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.
Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).
Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).
If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.
Cheers,
Flexx
| [reply] [Watch: Dir/Any] [d/l] [select] |
« Flexx: why do you assume that the second field of an semicolon-seperated file is meant? »
Umm.. Well because it looks like a CSV format? Experience seeing a bit of a problem and getting what the requirement is (a/k/a "getting 'all' the information from the customer" ;)?
And it appered that putting that first field in the regexp was more out of confusion as to how to "get to" the second field, something I do see often when someone learns how to use regular experessions. Along with too much use of .* to pull in fields, BTW, when "not the separator" ([^;]*) is often more correct, or even needed. Things get worse, once quoting is to be considered, of course.
But yeah, it was just an educated guess.
So long,
Flexx
| [reply] [Watch: Dir/Any] [d/l] [select] |