Re^2: Regex Extraction Help

in reply to Re: Regex Extraction Help
in thread Regex Extraction Help

invaderzard, just wanted to make clear that this solution by Kenosis is the far quicker and easier version, which I'd, of course use anytime I'd just need a quick split by a field separator on an input.

But: There is one caveat here to keep in mind. Split, of course, does not test the format of the input. So if you wanted the second field of a record that goes like this:

$record = 'A;B;C;D';
[download]

then

$second_field = (split ';', $record)[1];
[download]

does work. However so it does for inputs like:

A;B
#foo;B;ar
;B
[download]

All of the above inputs would leave a B in $second_field. Which, you know might be correct in a particular case, but in general, we don't want to just ignore malformed records, so if we, say, iterate over records, then make sure to test and capture using a regexp in an if:

if($record =~ m/^.;(.);.;.$/) {
  $second_field = $1;
}
[download]

Now this will only set $second_field if the record matches the four single-character fields delimited by one semicolon format. Even if the input is ';;;;;;;'. ;)

Have fun with regexen. They're cool. ;)

So long,
Flexx

Replies are listed 'Best First'.
Re^3: Regex Extraction Help by Kenosis (Priest) on Aug 09, 2012 at 19:05 UTC
You make a good point about `split`ting on a record separator within possibly malformed records. Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon. However, we can ask `split` to 'test' the format of the input, like this: `my $info = (split /\s;\s/, $dat)[1];` [download] This will return the info the OP wants, whether there are spaces before or after the semi-colon, or not. And within a regex on the OP's data: `use Modern::Perl; my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.'; $dat =~ /;\s(\w+)\s;.+;/ and say $1; #prints PF00070` [download] It was a good call to address this issue...	[reply] [d/l] [select]
Re^4: Regex Extraction Help by Flexx (Pilgrim) on Aug 09, 2012 at 22:08 UTC
« Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon » Oh indeed, my "warning" was meant more like a general tip, I didn't just mean this particular example. Just meant to say that it's a difference in how split vs if(m//) with some rather "strict" regexp typically result in a different level of defensiveness of the code. Again, I mean just typically. I mean hey, "just use split" would've been first answer, too. But you wrote that already, so I had to come up with something nitpicking. ;) « However, we can ask split to 'test' the format of the input » Umm... ok, you wrote 'test' in quotes, so alright... ;) Sure, you can combine the split and trim operation, but still, this split would happily work on any input you throw at it (including undef, with a warning, though). It won't tell you (by not even matching) that your input looks a bit strange there. Now, again, I am not so much talking about the OP's concrete problem, but was trying to educate a bit on what method to use when, since his usage of `\d\d\d\d\d` instead of `\d{5}` suggested that regexen ain't something he works with since years (No offence meant.) So long, Flexx	[reply] [d/l] [select]
Re^5: Regex Extraction Help by Kenosis (Priest) on Aug 09, 2012 at 22:16 UTC
You make more good points, and am glad you offered the "general tip," as it helps with developing good programming practices. Anticipating and coding for exceptions can (and does) save many headaches...	[reply]
Re^4: Regex Extraction Help by invaderzard (Acolyte) on Aug 10, 2012 at 14:45 UTC
Cheers, Kenosis, Flexx and Ratazong for your help! Kenosis, your method really worked like a charm for mine, but Kudos to Flexx and Ratazong for giving me a better insight on how to settle regex in perl. Thanks again!	[reply]
Re^5: Regex Extraction Help by Kenosis (Priest) on Aug 10, 2012 at 16:09 UTC
Glad it worked for you, invaderzard!	[reply]

Replies are listed 'Best First'.

Re^3: Regex Extraction Help
by Kenosis (Priest) on Aug 09, 2012 at 19:05 UTC

You make a good point about splitting on a record separator within possibly malformed records. Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon. However, we can ask split to 'test' the format of the input, like this:

my $info = (split /\s*;\s*/, $dat)[1];
[download]

This will return the info the OP wants, whether there are spaces before or after the semi-colon, or not.

And within a regex on the OP's data:

use Modern::Perl;

my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.';
$dat =~ /;\s*(\w+)\s*;.+;/ and say $1;    #prints PF00070
[download]

It was a good call to address this issue...

[reply]
[d/l]
[select]

Re^4: Regex Extraction Help

by Flexx (Pilgrim) on Aug 09, 2012 at 22:08 UTC

« Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon »

Oh indeed, my "warning" was meant more like a general tip, I didn't just mean this particular example. Just meant to say that it's a difference in how split vs if(m//) with some rather "strict" regexp typically result in a different level of defensiveness of the code. Again, I mean just typically. I mean hey, "just use split" would've been first answer, too. But you wrote that already, so I had to come up with something nitpicking. ;)

« However, we can ask split to 'test' the format of the input »

Umm... ok, you wrote 'test' in quotes, so alright... ;)

Sure, you can combine the split and trim operation, but still, this split would happily work on any input you throw at it (including undef, with a warning, though). It won't tell you (by not even matching) that your input looks a bit strange there.

Now, again, I am not so much talking about the OP's concrete problem, but was trying to educate a bit on what method to use when, since his usage of \d\d\d\d\d instead of \d{5} suggested that regexen ain't something he works with since years (No offence meant.)

So long,
Flexx

[reply]
[d/l]
[select]

Re^5: Regex Extraction Help

by Kenosis (Priest) on Aug 09, 2012 at 22:16 UTC

You make more good points, and am glad you offered the "general tip," as it helps with developing good programming practices. Anticipating and coding for exceptions can (and does) save many headaches...

[reply]

Re^4: Regex Extraction Help

by invaderzard (Acolyte) on Aug 10, 2012 at 14:45 UTC

Kenosis, your method really worked like a charm for mine, but Kudos to Flexx and Ratazong for giving me a better insight on how to settle regex in perl.

Thanks again!

[reply]

Re^5: Regex Extraction Help

by Kenosis (Priest) on Aug 10, 2012 at 16:09 UTC

Glad it worked for you, invaderzard!

[reply]