Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

lookahead, lookbehind, ... I'm lost

by ExReg (Priest)
on May 06, 2009 at 17:57 UTC ( [id://762340]=perlquestion: print w/replies, xml ) Need Help??

ExReg has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to get the meat off the bones of some .bas, .frm, and .cls VB files. Using a filet_o_bas script, I have slurped the file in to a scalar. What I want to do now is strip off the declarations at the beginning and have just the Subs and Functions left so that I can digest them. Each may have a descriptive header before the Sub/Function declaration set off by dashed lines.

I have the following RegEx so far. The problem is that I do not want it to start capturing with a Sub or Function line that has "Lib" in it, since these are not functions that are written out. They are just aliases to other already existing library functions.

/ ( #Start capture ( #Start Sub descript header '\*{20,} #start '------ delimiter (\n'[^\n]*?)+\n #middle of header '\*{20,}\n #end '------ delimiter )? #End Sub header (optional) (Private\s|Public\s|Friend\s)? #Scope (optional) (Static\s)? #Static (optional) (Sub\s|Function\s) #Sub or Function (mandatory) [^\n]+ #more stuff on same line (?!\sLib\s) #but not if Lib on line .* #and the rest of the file ) #End capture /sx

For example, if I try it on a file having the following portion, I want it to start capturing on line 27. Instead, it starts capturing on line 24.

File header stuf... 22 Private Const strLen = 80 23 24 Private Declare Function DeleteFile Lib "kernel32" Alias "DeleteFil +eA" _ 25 (ByVal lpFileName as String) as Long 26 27 '-------------------------------------------------------- 28 ' Purpose: Donuts 29 ' Author: not me 30 ' Date: yesterday 31 '-------------------------------------------------------- 32 Private Sub DrainBattery(Byval sPercentage as Single) Code stuff...

Replies are listed 'Best First'.
Re: lookahead, lookbehind, ... I'm lost
by Corion (Patriarch) on May 06, 2009 at 18:11 UTC

    A general rule of thumb is that a negative lookaround can never work if there is a .* or any other variable-length "general" pattern next to it. In your case, you have two such things, .* to the right and [^\n]+ to the left of (?!\sLib\s). You can easily check how Perl matched your strings against the regular expression by printing out the match variables:

    while (<DATA>) { / ( #Start capture ( #Start Sub descript header '\*{20,} #start '------ delimiter (\n'[^\n]*?)+\n #middle of header '\*{20,}\n #end '------ delimiter )? #End Sub header (optional) (Private\s|Public\s|Friend\s)? #Scope (optional) (Static\s)? #Static (optional) (Sub\s|Function\s) #Sub or Function (mandatory) [^\n]+ #more stuff on same line (?!\sLib\s) #but not if Lib on line .* #and the rest of the file ) #End capture /sx and print "[$6/$7/$8]\n"; }; __DATA__ Private Declare Function DeleteFile Lib "kernel32" Alias "DeleteFileA" + _

    One approach to make your parser more robust is to make it more specific, like explicitly parsing out the function/sub name and expecting (or rather, denying) the Lib keyword immediately after the function name:

    while (<DATA>) { print; / ( #Start capture ( #Start Sub descript header '\*{20,} #start '------ delimiter (\n'[^\n]*?)+\n #middle of header '\*{20,}\n #end '------ delimiter )? #End Sub header (optional) (Private\s|Public\s|Friend\s)? #Scope (optional) (Static\s)? #Static (optional) (Sub\s|Function\s) #Sub or Function (mandatory) (\w+)\s+ # sub name ((?!Lib\s)) #no Lib on line ([^\n]+) #more stuff on same line unless "Lib" (.*) #and the rest of the file ) #End capture /sx and print "[$6/$7/$8]"; }; __DATA__ Private Declare Function DeleteFile Lib "kernel32" Alias "DeleteFileA"

    To be a bit more specific about the word "general" above, a negative lookahead will never work if there is a variable length quantifier next to it with a pattern that will also match (parts of) the phrase you want to avoid.

      Thanks so much for your quick reply. Sorry it took me so long to get back. I had made a few mistakes, and it doesn't take much to screw up a RegEx. I put in a bit more detail, like capturing the fucntion name and parameters. A bit more screaming at it until a slash turned into a backslash, and it finally worked.

      use strict; undef $/; while (<DATA>) { print; / ( #Start capture ( #Start Sub descript header '-{20,} #start '------ delimiter (\n'[^\n]*?)+\n #middle of header '-{20,}\n #end '------ delimiter )? #End Sub header (optional) (Private\s|Public\s|Friend\s)? #Scope (optional) (Static\s)? #Static (optional) (Sub\s|Function\s) #Sub or Function (mandatory) (\w+) #Sub name ( #Start Params \( #Left paren ([^\)])* #Optional params inside \) #Right paren )? #End Params (optional) \s+ #Space ((?!Lib\s)) #no Lib on line ([^\n]+) #more stuff on same line unless "Lib +" (.*) #and the rest of the file ) #End capture /sx; }; __DATA__ Private thingy as String Private Declare Function DeleteFile Lib "kernel32" Alias "DeleteFileA" '----------------------------------------------------------- ' This is a header '----------------------------------------------------------- Private Function foo(byVal x as Long, byVal y as String) as Integer

      It correctly ignores the first three lines and starts capturing at the '-----------. Thanks!

Re: lookahead, lookbehind, ... I'm lost
by Roy Johnson (Monsignor) on May 06, 2009 at 18:13 UTC
    If you don't mind my whoring my own tutorial (you may have read it), the idiom you're looking for is:

    Matching a pattern that doesn't include another pattern

    You might want to capture everything between foo and bar that doesn't include baz. The technique is to have the regex engine look-ahead at every character to ensure that it isn't the beginning of the undesired pattern:
    /foo # Match starting at foo ( # Capture (?: # Complex expression: (?!baz) # make sure we're not at the beginning of baz . # accept any character )* # any number of times ) # End capture bar # and ending at bar /x;
    Note that you have to have the lookahead checked for every character you're accepting. In your case, the [^\n]+ sub-pattern needs to be adjusted to make sure every character is not the beginning of a Lib:
    (?: # Complex expression: (?!Lib) # make sure we're not at the beginning of Lib [^\n] # accept any character )+ # any positive number of times

    Caution: Contents may have been coded under pressure.
      Great tutorial! Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://762340]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-24 00:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found