Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Complex regex with negated group

by december (Pilgrim)
on Mar 27, 2011 at 00:47 UTC ( [id://895718]=perlquestion: print w/replies, xml ) Need Help??

december has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks,

I've spent more than a day to come up with a regex but I can't seem to get it together. I'm hoping someone with better knowledge of exotic verbs or other tricks can show me the way.

This is the regex:

#foreach ($strMess =~ /.*?\n(?=\S)/smg) { #foreach ($strMess =~ /.*?\n(?=\S|\s+[^\n]+?:\n\S)/smg) { foreach ($strMess =~ /.*?\n(?=\S|\s+.+?(?:\n\S(*FAIL)):\n\S)/smg) { chomp; s/\n\s*/ /mg; push(@arr, $_); }

Now let me explain what I want to do:

If a line starts with a space, it's a continuation of the previous line, so only split on lines that have a character on the next line. That's the first commented-out regex, quite straightforward.

But now there's an exception. If a line starts with spaces but ends with a colon, it's not a continuation line, so don't split on it. This is the second commented-out regex, and it works too.

Of course, the line with the colon can contain continuation lines itself. The colon could be several lines down. So, eat everything non-greedily until we've found a colon-newline-wordcharacter sequence and PASS, but fail if at any point there's a newline-wordcharacter indicating a new item. In pseudo code:

[^\n]+?(\n\S?FAIL):\n\S

Here's some data. The first part is some extra introductory text. The lines starting with spaces and ending in colons indicate opera acts. The lines starting with numbers are CD tracks. Both acts as song titles can continue on the next line, indented with spaces. The regex splits the lines into an array, keeping continuing lines together. The problem is "acts"-lines continuing over multiple lines, hence I'm looking for a regex that can either have a negating group (^(\n\S)) or some other way to fail the look-ahead part if there's a newline that isn't a continuation line. I'm sure it can be done, but I guess I don't know enough about the fancy regex features.

Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8:33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) /6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister steps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, between the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty -) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another dumb example /6:66. 10 10. Last fictive song title /6:66.

This should be the result, with all continuation lines merged into one (line numbers are not part of the data):

1| Into the little hill 2| 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoist +a 3| soittajalle 4| 1. OSA (01-05) /20:28: 5| 01 1. The crowd (Kill them they bite, kill them they steal -) /0 +:50. 6| 02 2. The minister and the crowd (The minister greets the crowd +-) /2:50. 7| 03 3. The crowd (Kill them they bite, kill them they steal -) /1 +:42. 8| 04 4. The minister and the stranger (Night comes but not sleep - +) /8:33. 9| 05 5. Interlude - Mother and child (Why must the rats die, Mummy +? -) /6:33. 10| 2. OSA (06-08) /16:34: 11| 06 6. Inside the minister's head (Under a clear sky, the ministe +r steps from the limousine -) /3:43. 12| 07 7. The minister and the stranger (His head lies on his desk, +between the family photograph -) /5:52. 13| 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks em +pty -) /6:59. 14| 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH + BLAH BLAH BLAH (09-10) /14:14: 15| 09 9. Another very long stupid song title to be used as yet anot +her dumb example /6:66. 16| 10 10. Last fictive song title /6:66.

I feel I'm close, if I can only find a way to make the look-ahead assertion fail if it sees a non-continuation line \n\S before a :\n\S – in other words, if the continued line doesn't end in a colon, it's not an opera act, the look-ahead should fail and we should not split the data on that newline.

Any clues? Pretty please?

Thanks!



PS: don't make any easy assumptions based on the data. The records are in a pretty rotten free-form format in which almost anything is possible... *cry*

Replies are listed 'Best First'.
Re: Complex regex with negated group
by GrandFather (Saint) on Mar 27, 2011 at 01:28 UTC

    The trick with this sort of thing is to defer output of the current record until the decision can be made about the next record:

    use strict; use warnings; my $record = ''; while (defined (my $line = <DATA>)) { chomp $line; if ($line =~ /^\S|:$/ && length $record) { print "$record\n"; $record = ''; } $record .= ' ' if length $record; $record .= $line; } print "$record\n" if length $record; __DATA__ Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8:33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) /6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister steps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, between the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty -) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another dumb example /6:66. 10 10. Last fictive song title /6:66.

    Prints:

    Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) + /2:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) + /8:33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) + /6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister + steps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, b +etween the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty + -) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH +BLAH BLAH BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another + dumb example /6:66. 10 10. Last fictive song title /6:66.
    True laziness is hard work
Re: Complex regex with negated group
by wind (Priest) on Mar 27, 2011 at 01:40 UTC

    Instead of doing a split, doing line by line processing makes more sense. Just create a single regex to detect what looks like a continued line to you.

    while (<DATA>) { chomp; # Continued Line if (/^\s+(.*)(?<!:)$/) { print " $1"; # New Line } else { print "\n" if $. > 1; print $_; } } print "\n"; __DATA__ Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8:33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) /6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister steps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, between the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty -) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another dumb example /6:66. 10 10. Last fictive song title /6:66.
    Outputs
    Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2 +:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8: +33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) +/6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister ste +ps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, betwe +en the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty - +) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLA +H BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another d +umb exampl e /6:66. 10 10. Last fictive song title /6:66.

    However, I do see one problem with your stated logic after '08' in your sample date, as I suspect you want the following to appear on its own line:

    3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH + BLAH BLAH (09-10) /14:14:
    The following looks closer to what you want.
    while (<DATA>) { chomp; # Continued Line if (/^(?>\s+)(?!\d+\.)(.*)$/) { print " $1"; # New Line } else { print "\n" if $. > 1; print $_; } } print "\n";
    Outputs
    Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2 +:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8: +33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) +/6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister ste +ps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, betwe +en the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty - +) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH + BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another d +umb example /6:66. 10 10. Last fictive song title /6:66.
Re: Complex regex with negated group
by repellent (Priest) on Mar 27, 2011 at 08:54 UTC
    my $line = ""; while (local $_ = <DATA>) { print and next if $. <= 3; chomp; ($line = $_) =~ s/\S.*// if $line eq ""; # keep indent s/^\s*//; $line .= $_ . " "; if (m{/\d+:\d+ \W* $}x) { print "$line\n"; $line = ""; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://895718]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-20 00:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found