|Think about Loose Coupling|
Complex regex with negated groupby december (Pilgrim)
|on Mar 27, 2011 at 00:47 UTC||Need Help??|
december has asked for the wisdom of the Perl Monks concerning the following question:
Hello fellow monks,
I've spent more than a day to come up with a regex but I can't seem to get it together. I'm hoping someone with better knowledge of exotic verbs or other tricks can show me the way.
This is the regex:
Now let me explain what I want to do:
If a line starts with a space, it's a continuation of the previous line, so only split on lines that have a character on the next line. That's the first commented-out regex, quite straightforward.
But now there's an exception. If a line starts with spaces but ends with a colon, it's not a continuation line, so don't split on it. This is the second commented-out regex, and it works too.
Of course, the line with the colon can contain continuation lines itself. The colon could be several lines down. So, eat everything non-greedily until we've found a colon-newline-wordcharacter sequence and PASS, but fail if at any point there's a newline-wordcharacter indicating a new item. In pseudo code:
Here's some data. The first part is some extra introductory text. The lines starting with spaces and ending in colons indicate opera acts. The lines starting with numbers are CD tracks. Both acts as song titles can continue on the next line, indented with spaces. The regex splits the lines into an array, keeping continuing lines together. The problem is "acts"-lines continuing over multiple lines, hence I'm looking for a regex that can either have a negating group (^(\n\S)) or some other way to fail the look-ahead part if there's a newline that isn't a continuation line. I'm sure it can be done, but I guess I don't know enough about the fancy regex features.
This should be the result, with all continuation lines merged into one (line numbers are not part of the data):
I feel I'm close, if I can only find a way to make the look-ahead assertion fail if it sees a non-continuation line \n\S before a :\n\S – in other words, if the continued line doesn't end in a colon, it's not an opera act, the look-ahead should fail and we should not split the data on that newline.
Any clues? Pretty please?
PS: don't make any easy assumptions based on the data. The records are in a pretty rotten free-form format in which almost anything is possible... *cry*