Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hello fellow monks,

I've spent more than a day to come up with a regex but I can't seem to get it together. I'm hoping someone with better knowledge of exotic verbs or other tricks can show me the way.

This is the regex:

#foreach ($strMess =~ /.*?\n(?=\S)/smg) { #foreach ($strMess =~ /.*?\n(?=\S|\s+[^\n]+?:\n\S)/smg) { foreach ($strMess =~ /.*?\n(?=\S|\s+.+?(?:\n\S(*FAIL)):\n\S)/smg) { chomp; s/\n\s*/ /mg; push(@arr, $_); }

Now let me explain what I want to do:

If a line starts with a space, it's a continuation of the previous line, so only split on lines that have a character on the next line. That's the first commented-out regex, quite straightforward.

But now there's an exception. If a line starts with spaces but ends with a colon, it's not a continuation line, so don't split on it. This is the second commented-out regex, and it works too.

Of course, the line with the colon can contain continuation lines itself. The colon could be several lines down. So, eat everything non-greedily until we've found a colon-newline-wordcharacter sequence and PASS, but fail if at any point there's a newline-wordcharacter indicating a new item. In pseudo code:

[^\n]+?(\n\S?FAIL):\n\S

Here's some data. The first part is some extra introductory text. The lines starting with spaces and ending in colons indicate opera acts. The lines starting with numbers are CD tracks. Both acts as song titles can continue on the next line, indented with spaces. The regex splits the lines into an array, keeping continuing lines together. The problem is "acts"-lines continuing over multiple lines, hence I'm looking for a regex that can either have a negating group (^(\n\S)) or some other way to fail the look-ahead part if there's a newline that isn't a continuation line. I'm sure it can be done, but I guess I don't know enough about the fancy regex features.

Into the little hill 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoista soittajalle 1. OSA (01-05) /20:28: 01 1. The crowd (Kill them they bite, kill them they steal -) /0:50. 02 2. The minister and the crowd (The minister greets the crowd -) /2:50. 03 3. The crowd (Kill them they bite, kill them they steal -) /1:42. 04 4. The minister and the stranger (Night comes but not sleep -) /8:33. 05 5. Interlude - Mother and child (Why must the rats die, Mummy? -) /6:33. 2. OSA (06-08) /16:34: 06 6. Inside the minister's head (Under a clear sky, the minister steps from the limousine -) /3:43. 07 7. The minister and the stranger (His head lies on his desk, between the family photograph -) /5:52. 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks empty -) /6:59. 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH (09-10) /14:14: 09 9. Another very long stupid song title to be used as yet another dumb example /6:66. 10 10. Last fictive song title /6:66.

This should be the result, with all continuation lines merged into one (line numbers are not part of the data):

1| Into the little hill 2| 2-osainen lyyrinen tarina sopraanolle, altolle seka viidelletoist +a 3| soittajalle 4| 1. OSA (01-05) /20:28: 5| 01 1. The crowd (Kill them they bite, kill them they steal -) /0 +:50. 6| 02 2. The minister and the crowd (The minister greets the crowd +-) /2:50. 7| 03 3. The crowd (Kill them they bite, kill them they steal -) /1 +:42. 8| 04 4. The minister and the stranger (Night comes but not sleep - +) /8:33. 9| 05 5. Interlude - Mother and child (Why must the rats die, Mummy +? -) /6:33. 10| 2. OSA (06-08) /16:34: 11| 06 6. Inside the minister's head (Under a clear sky, the ministe +r steps from the limousine -) /3:43. 12| 07 7. The minister and the stranger (His head lies on his desk, +between the family photograph -) /5:52. 13| 08 8. Interlude - Mother(s) and child(ren) (Each cradle rocks em +pty -) /6:59. 14| 3. OSA BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH BLAH + BLAH BLAH BLAH (09-10) /14:14: 15| 09 9. Another very long stupid song title to be used as yet anot +her dumb example /6:66. 16| 10 10. Last fictive song title /6:66.

I feel I'm close, if I can only find a way to make the look-ahead assertion fail if it sees a non-continuation line \n\S before a :\n\S – in other words, if the continued line doesn't end in a colon, it's not an opera act, the look-ahead should fail and we should not split the data on that newline.

Any clues? Pretty please?

Thanks!



PS: don't make any easy assumptions based on the data. The records are in a pretty rotten free-form format in which almost anything is possible... *cry*


In reply to Complex regex with negated group by december

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-16 20:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found