http://qs321.pair.com?node_id=856204

bronto has asked for the wisdom of the Perl Monks concerning the following question:

Hello there

I need some help with Parse::RecDescent. I successfully created a grammar to parse a pseudo-ini file (more on the format in a second), but I would like to expand it to parse inline comments, as well as single-line comments.

Disclaimer: I didn't choose the format, and I can't change it. The only thing I can do with it is to parse it.

The file has section declarations like this:

[section_name]

and key/value associations like this:

parameter=value values="may also be quoted"

It can have comment lines

    ; like this

and it allows for blank lines.

it admits multiple assignments to the same parameter, resulting in an array of values for that parameter:

parameter=value1 parameter=value2 parameter=value3 ; that would result in parameter = ( value1, value2, value3 )

a parameter could be assigned a full Pike multivalue data structure, that is an array:

    parameter=({ value1, value2, value3 })

or a mapping (key/value pairs):

    parameter=([ "key1" : "value1", "key2" : "value2" ])

these structures could be multilined and nested!!! As in:

parameter=({ ([ "key1" : "value1", "key2" : ({ "array", "value" }), ]), "second element of this array", ({ "and here is another array", ({ "with another one nested", }), ([ "that" : "contains", "one" : "more", "hash" : "value", ]), }), "Hooray!", })

As said, I made a grammar that correctly parses a file 350kb big (slow, but works well):

# $Id: g3.txt,v 1.14 2010/07/23 12:41:24 bronto Exp bronto $ AsIni: Line(s?) /\Z/ Line: CommentLine | BlankLine | SectionDeclaration | AssignmentLine | <error> CommentLine: <skip: q{}> /^\s*/m ';' /.*$/m { print STDERR qq{\tSkipping comment: $item[4]\n} ; } BlankLine: <skip: q{}> /^\s+$/m { print STDERR qq{\tSkipping blank line\n} } SectionDeclaration: '[' /[^\]]+/ ']' { print STDERR qq{In section "$item[2]"\n} ; my $sectionname = $item[2] ; $AsIni::section = $sectionname ; } AssignmentLine: Parameter '=' Value(?) { my $distvalue = $item[3] ; my $parmname = $item[1] ; my $paramvalue ; ( $paramvalue ) = @$distvalue ; if ( not exists $AsIni::node{$AsIni::section}{$parmname} ) { $AsIni::node{$AsIni::section}{$parmname} = [] ; } # Get a reference to the current array of values for this paramete +r # in $current my $current = $AsIni::node{$AsIni::section}{$parmname} ; # We can update this safely, since we are using the reference push @$current,$paramvalue ; } Parameter: /\w[\w\s-]*/ { $return = $item[1] ; } Value: PikeStructure | ValueString { $return = $item[1] ; } ValueString: QuotedString | UnquotedString { $return = $item[1] ; } QuotedString: '"' /[^"]+/ '"' { $return = $item[2] ; } UnquotedString: /.+/ { $return = $item[1] ; } # This rule matches a number, but rejects null-length results Number: /[+-]?\d*(\.\d+)?/ <reject: $item[1] eq ''> { $return = $item[1] ; } PikeStructure: PikeArray | PikeMapping { # $item[1] is a reference to an array (PikeArray) or hash # (PikeMapping). We bubble it up as is $return = $item[1] ; } PikeArray: '({' PikeArrayContent '})' { # $item[2] is a PikeArrayContent, and since PikeArrayContent # bubbles up a reference to an array of PikeValues, this should be # an array reference that we can safely bubble up as is. $return = $item[2] ; } PikeMapping: '([' PikeMappingContent '])' { # $item[2] is a PikeMappingContent, and since PikeMappingContent # bubbles up an hash reference, this should be an hash reference t +hat we # can safely bubble up as is. $return = $item[2] ; } PikeStructureSeparator: ',' PikeArrayContent: PikeArraySequence(?) PikeStructureSeparator(?) { # $item[1] comes from a repetition of PikeArraySequence, # so it is a reference to an array of 0 or 1 PikeArraySequence. # In turn, PikeArraySequence is a reference to an array of # PikeValue's. We don't want to change the PikeValue's but we need # to unroll $item[1] before bubbling it up. ( $return ) = @{ $item[1] } ; } PikeArraySequence: PikeValue PikeArrayFurtherValue(s?) { # $item[1] is a PikeValue, hence: # - a reference to an array or hash (if PikeStructure) # - a scalar (if QuotedString or Number) # # $item[2] comes from a repetition of PikeArrayFurtherValue, # so it is a reference to an array of 0 or 1 PikeArrayFurtherValue +. # In turn, PikeArrayFurtherValue just returns a PikeValue. So, # we actually don't want to change $item[1], but we need to # unroll $item[2] before returning it. Actually, we return an # array reference with the whole thing. $return = [ $item[1], @{ $item[2] } ] ; } PikeArrayFurtherValue: PikeStructureSeparator PikeValue { # $item[2] is a PikeValue, hence: # - a reference to an array or hash (if PikeStructure) # - a scalar (if QuotedString or Number) # We bubble it up as is. $return = $item[2] ; } PikeMappingContent: PikeMappingSequence(?) PikeStructureSeparator(? +) { # Since we have a repetition here, $item[1] is a reference to an # array which may contain 0 or 1 PikeMappingSequence's. # In turn, PikeMappingSequence returns an hash reference. # So, if we want the hash reference to bubble up, we have to # unwrap it and return it as is. ( $return ) = @{ $item[1] } ; } PikeMappingSequence: PikeMappingPair PikeMappingFurtherPair(s?) { # $item[1] is a PikeMappingPair, hence a reference to an array # of two elements: a string and a PikeValue, that is: # - a reference to an array or hash (if Pikevalue ~ PikeStructure) # - a scalar (if PikeValue ~ QuotedString or Number) # # $item[2] has a repetition, so it is a reference to an array of # PikeMappingFurtherPair's. Since PikeMappingFurtherPair just # returns a PikeMappingPair (see $item[1]), then $item[2] is # a reference to an array where each element is, in turn, a # reference to an array of two elements. # # Since we are going to return an hash here, we create a reference # to an hash; to correctly unroll the values of $item[1] and # $item[2] we: # - simply dereference $item[1], hence unrolling the only hash # pair the array contained # - we dereference $item[2], getting an array of arrays, and # then we use map to further unroll the key/value pairs # # We then bubble up the outcome $return = { @{ $item[1] } , map( @$_ , @{ $item[2] } ) } ; } PikeMappingPair: QuotedString ':' PikeValue { # $item[1] is a scalar (QuotedString) # $item[3] is a PikeValue, hence: # - a reference to an array or hash (if Pikevalue ~ PikeStructure) # - a scalar (if PikeValue ~ QuotedString or Number) # We throw them up together as a single entity: a reference to an +array $return = [ $item[1], $item[3] ] ; } PikeMappingFurtherPair: PikeStructureSeparator PikeMappingPair { # $item[2] is a PikeMappingPair, hence a reference to an array # containing a QuotedString (the first) and a PikeValue, hence: # - a reference to an array or hash (if Pikevalue ~ PikeStructure) # - a scalar (if PikeValue ~ QuotedString or Number) $return = $item[2] ; } PikeValue: PikeStructure | QuotedString | Number { # $item[1] is: # - a reference to an array or hash (if PikeStructure) # - a scalar (if QuotedString or Number) $return = $item[1] ; }

I would like to extend it so that I could use inline comments, e.g.:

parameter="value" ; like this parameter=({ "value1", ; but "value2", ; also "value3", ; like "value4", }) ; this

I tried a few solutions, with the only result to make the parser fail, or have one inline comment swallowing far more than it should... Any suggestions?

Thanks in advance!

Ciao!
--bronto


In theory, there is no difference between theory and practice. In practice, there is.

Replies are listed 'Best First'.
Re: Extending a Parse::RecDescent grammar for inline comments
by JavaFan (Canon) on Aug 20, 2010 at 09:16 UTC
    I would use the SKIP directive - the pattern used to skip text between tokens (whitespace by default). I don't remember the syntax for setting the directive, but setting it to
    /\s*(?:;[^\n]*\n)?/
    should go a long way.

      Thanks for the suggestion! Is it possible to do it in a way that I can capture the content of the comment?

      Ciao!
      --bronto


      In theory, there is no difference between theory and practice. In practice, there is.
        Then it's not a comment at all (i.e. it's meaningful), and you'd match it like anything else (i.e. not use <skip>).
Re: Extending a Parse::RecDescent grammar for inline comments
by sundialsvc4 (Abbot) on Aug 20, 2010 at 19:49 UTC

    There’s only so far that you can go with a recursive-descent parsing solution, and, unfortunately, it isn’t very far at all.   It is not your only alternative, nor is it necessarily the best one if the language that you need to process becomes complicated.   Don't be “in a hurry” to change your approach, of course, but don't be afraid to, either.   If you find your program becoming complicated in structure, just because the language is, that ought to be a warning-flag to you.   Just keep your eyes open and continue to use your best judgment.

Re: Extending a Parse::RecDescent grammar for inline comments
by CountZero (Bishop) on Aug 20, 2010 at 17:38 UTC
    I know nothing about Parse::RecDescent, but my (naive) solution would be to pre-process the file and delete the offending comments. Provided you will have no more than one ';' per line that could be easily done with a split.

    That would also allow you to capture the comments and save them somewhere.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James