http://qs321.pair.com?node_id=986558


in reply to Re: Regex Extraction Help
in thread Regex Extraction Help

Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.

So I'd suggest this:

# assuming the raw data is in $line. $line =~ m/^[^;]*;\s*([^;]*?)\s*;/ # $1 now holds whatever is between the second and third # semicolon, leading and trailing spaces trimmed.

Now, what am I doing here?

First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.

Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).

Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).

If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.

Cheers,
Flexx