http://qs321.pair.com?node_id=986559


in reply to Re: Regex Extraction Help
in thread Regex Extraction Help

invaderzard, just wanted to make clear that this solution by Kenosis is the far quicker and easier version, which I'd, of course use anytime I'd just need a quick split by a field separator on an input.

But: There is one caveat here to keep in mind. Split, of course, does not test the format of the input. So if you wanted the second field of a record that goes like this:

$record = 'A;B;C;D';
then
$second_field = (split ';', $record)[1];

does work. However so it does for inputs like:

A;B #foo;B;ar ;B

All of the above inputs would leave a B in $second_field. Which, you know might be correct in a particular case, but in general, we don't want to just ignore malformed records, so if we, say, iterate over records, then make sure to test and capture using a regexp in an if:

if($record =~ m/^.;(.);.;.$/) { $second_field = $1; }

Now this will only set $second_field if the record matches the four single-character fields delimited by one semicolon format. Even if the input is ';;;;;;;'. ;)

Have fun with regexen. They're cool. ;)

So long,
Flexx

Replies are listed 'Best First'.
Re^3: Regex Extraction Help
by Kenosis (Priest) on Aug 09, 2012 at 19:05 UTC

    You make a good point about splitting on a record separator within possibly malformed records. Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon. However, we can ask split to 'test' the format of the input, like this:

    my $info = (split /\s*;\s*/, $dat)[1];

    This will return the info the OP wants, whether there are spaces before or after the semi-colon, or not.

    And within a regex on the OP's data:

    use Modern::Perl; my $dat = 'DR Pfam; PF00070; Pyr_redox; 2.'; $dat =~ /;\s*(\w+)\s*;.+;/ and say $1; #prints PF00070

    It was a good call to address this issue...

      « Based upon the OP's regex, it appears that the pattern's stable--with one space after the semi-colon »

      Oh indeed, my "warning" was meant more like a general tip, I didn't just mean this particular example. Just meant to say that it's a difference in how split vs if(m//) with some rather "strict" regexp typically result in a different level of defensiveness of the code. Again, I mean just typically. I mean hey, "just use split" would've been first answer, too. But you wrote that already, so I had to come up with something nitpicking. ;)

      « However, we can ask split to 'test' the format of the input »

      Umm... ok, you wrote 'test' in quotes, so alright... ;)

      Sure, you can combine the split and trim operation, but still, this split would happily work on any input you throw at it (including undef, with a warning, though). It won't tell you (by not even matching) that your input looks a bit strange there.

      Now, again, I am not so much talking about the OP's concrete problem, but was trying to educate a bit on what method to use when, since his usage of \d\d\d\d\d instead of \d{5} suggested that regexen ain't something he works with since years (No offence meant.)

      So long,
      Flexx

        You make more good points, and am glad you offered the "general tip," as it helps with developing good programming practices. Anticipating and coding for exceptions can (and does) save many headaches...

      Cheers, Kenosis, Flexx and Ratazong for your help!

      Kenosis, your method really worked like a charm for mine, but Kudos to Flexx and Ratazong for giving me a better insight on how to settle regex in perl.

      Thanks again!

        Glad it worked for you, invaderzard!