http://qs321.pair.com?node_id=986545


in reply to Regex Extraction Help

Hi

Your main issue seems to be that you fetch too much or the wrong things. Reminder: you need to put the thing you want to get in round brackets ... which is the \w\w\d\d\d\d\d in this case. So your regex could look like

/DR\s+Pfam;\s+(\w\w\d\d\d\d\d);\s+/
(I don't see the need for look-aheads or or look-behinds here.)

HTH, Rata

update Flexx: why do you assume that the second field of an semicolon-seperated file is meant? I agree that the specification is very vague. However from the examples given by invaderzard, it seems that the text DR   Pfam as well as the format (2 letters, 5 digits) are the important parts ...

Replies are listed 'Best First'.
Re^2: Regex Extraction Help
by Flexx (Pilgrim) on Aug 09, 2012 at 17:02 UTC

    Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.

    So I'd suggest this:

    # assuming the raw data is in $line. $line =~ m/^[^;]*;\s*([^;]*?)\s*;/ # $1 now holds whatever is between the second and third # semicolon, leading and trailing spaces trimmed.

    Now, what am I doing here?

    First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.

    Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).

    Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).

    If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.

    Cheers,
    Flexx

Re^2: Regex Extraction Help
by Flexx (Pilgrim) on Aug 15, 2012 at 21:52 UTC
    « Flexx: why do you assume that the second field of an semicolon-seperated file is meant? »

    Umm.. Well because it looks like a CSV format? Experience seeing a bit of a problem and getting what the requirement is (a/k/a "getting 'all' the information from the customer" ;)?

    And it appered that putting that first field in the regexp was more out of confusion as to how to "get to" the second field, something I do see often when someone learns how to use regular experessions. Along with too much use of .* to pull in fields, BTW, when "not the separator" ([^;]*) is often more correct, or even needed. Things get worse, once quoting is to be considered, of course.

    But yeah, it was just an educated guess.

    So long,
    Flexx