Re^2: Regex Extraction Help

Well, I guess all the fields are variable, and what invaderzard meant, was to get that second field.

So I'd suggest this:

# assuming the raw data is in $line.
$line =~ m/^[^;]*;\s*([^;]*?)\s*;/

# $1 now holds whatever is between the second and third 
# semicolon, leading and trailing spaces trimmed.
[download]

Now, what am I doing here?

First I say: Let's start at the beginning (^). This is important, since we can't exclude the possibility that the pattern repeats in one instance of $line.

Next, I say: give me zero or more non-semicolon characters ([^;]*), followed by exactly one semicolon (;).

Now our "cursor" would be in the second field, quasi. We say, well, there might or might not be some leading space (\s*). Then comes the data we want, that's why we use parentheses to capture it. What do we wanna capture? Well, again, anything not a semicolon ([^;]*?), but this time, non-greedily (using the *? quantifier.). Well, that's because we want any trailing space to go into the \s* that follows, instead of it being captured. Lastly, we need to require that the field is terminated by exactly one semicolon (;).

If you want to capture other fields as well, then a solution using split, like it's been suggested below is a more efficient way of doing it. If you want just a few fields of a long CSV record (which this seems to be, only demimited by semicola instead of kommas, then you also could expand on the regexp above, which might be a bit more performant than split. But I didn't really check that with benchmarks. Just an inkling I'd have, and very dependent on the length of the input, and the number of fields in it.

Cheers,
Flexx

Comment on Re^2: Regex Extraction Help Select or Download Code


No such thing as a small change
	PerlMonks