http://qs321.pair.com?node_id=840678

back-n-black has asked for the wisdom of the Perl Monks concerning the following question:

I've never been that good at regular expressions. What I want to do is parse many log entries for words, ultimately, in SQL like expressions.

For example.

$line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'brother' ) or ( is-mother.to code = 'dog' ) )";

What ultimately I need out of these strings are:

my.my id is-relative.to code is-relative.to code is-mother.to code

but something like this would be great!

( my.my id >= 1 ) ( is-relative.to code = 'sister' ) ( is-relative.to code = 'brother' ) ( is-mother.to code = 'dog' )

or

my.my id >= 1 ) is-relative.to code = 'sister' is-relative.to code = 'brother' is-mother.to code = 'dog'

I have been looking a while for hints to an elegant resolution for this problem. There is much dialogue about the use of Text::Balanced but not enough examples in the documentation for my little brain, to help me solve the riddle.

I have an example here that just pulls the expressions, I know what to do from there. I would like some ideas or code examples on a more elegant solution using one of the CPAN modules if that is possible.

What it basically does is:

  1. Split the text at the first close parens
  2. Parse the expression out of this "before" text
    • Remove everything up to and including the last open paren
    • Remove any beginning or trailing spaces
  3. Split the "after" text this time, and repeat the above operations

Here is a snippet of code that pulls the expressions

$text = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) an +d ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'br +other' ) or ( is-mother.to code = 'dog' ) )"; my $new = $text; while ( 1 ) { $ind = index($new, ')'); # Split the text at the first close parens $before = substr($new,0,$ind); $after = substr($new,$ind); last if ( $before eq "" ); # Clean up the before string # Remove everything up to and including the last open paren # Remove any beginning or trailing spaces $before = substr($before,rindex($before,'(')+1); $before =~ s/^\s+//; $before =~ s/\s+$//; push(@list,$before); if ( $after =~ /\)/ ) { # Disgard chars up to the first open paren $after = substr($after,index($after,'(')+1); $new = $after; print "\n"; } else { last; } } foreach my $i (@list) { print "--".$i."--\n"; }

Replies are listed 'Best First'.
Re: Elegant examples to parse parenthesised strings
by jettero (Monsignor) on May 19, 2010 at 12:11 UTC
    I've never been that good at regular expressions.

    There's a problem here. Traditionally, regular expressions can't match things that count (anbn). Perl RE have things that help with this (things that aren't really DFA-y); but you still probably want to look at things like Text::Balanced.

    -Paul

Re: Elegant examples to parse parenthesised strings
by cdarke (Prior) on May 19, 2010 at 13:06 UTC
    When parsing strings I usually go for the 'brute-force' approach (I'm just that kinda guy), splitting around each char and testing one-at-a-time:
    use feature ":5.10"; use strict; use warnings; my $line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) + and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = +'brother' ) or ( is-mother.to code = 'dog' ) )"; my $open = 0; my @result = (''); for my $char (split ('',$line)) { given ($char) { when ('(') { $open++ } when (')') { $open--; push @result,'' } default { $result[-1] .= $char if $open } } }
    That gives me:
    my.my id >= 1 is-relative.to code = 'sister' or is-relative.to code = 'brother' or is-mother.to code = 'dog'
    Which is not good enough, so I need some tidy-up:
    for my $line (@result) { $line =~ s/^\s+(?:or\s+)?(.*)[ ><!=]=.*/$1\n/; } @result = grep !/^\s+$/,@result; print @result,"\n";
    Which gives:
    my.my id is-relative.to code is-relative.to code is-mother.to code
    which I think is what you are looking for. Not sure that it is all that elegant though.
Re: Elegant examples to parse parenthesised strings
by Krambambuli (Curate) on May 19, 2010 at 12:48 UTC
    If, as it seems, you need only the innermost parenthesises, something like
    my @exps = $text =~ / \( ( [^\(]*? ) \) /xmg;
    might be good enough for this particular situation.


    Krambambuli
    ---
Re: Elegant examples to parse parenthesised strings
by toolic (Bishop) on May 19, 2010 at 13:02 UTC
Re: Elegant examples to parse parenthesised strings
by Marshall (Canon) on May 19, 2010 at 18:04 UTC
    I guess another fairly brute force approach..

    #!/usr/bin/perl -w use strict; my $line = "05/04/2010 13:09:45 - A - somebody - ( (( my.my id >= 1 )) + ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code += 'brother' ) or ( is-mother.to code = 'dog' ) )"; my @terms = $line =~ /\(.*?\)/g; foreach (@terms) { s/.*(\(.*\)).*/$1/; s/^\(\s*//; s/\s*\)$//; print "$_\n"; } __END__ Prints: my.my id >= 1 is-relative.to code = 'sister' is-relative.to code = 'brother' is-mother.to code = 'dog'
Re: Elegant examples to parse parenthesised strings (Parse::Balanced)
by repellent (Priest) on May 24, 2010 at 05:29 UTC
    A rather obscure module I wrote some time ago experiments with using complex data structures to define how a string is to be parsed in a balanced way.

    A list of tokens describes the parsing behavior. Each token could either be a string, a regexp, or an arrayref of tokens. The arrayref is used to define a "balance branch" where its first token is searched for in order to enter the branch, and its last token is searched for to leave the branch. Any string/regexp token in between is skipped over as an "escaped" literal.
    use Data::Dumper; use Parse::Balanced qw(parse_balanced is_balanced); my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) +and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = ' +brother' ) or ( is-mother.to code = 'dog' ) )"; # recursively parse opening and closing parentheses my @tokens; @tokens = ( "(", \@tokens, ")" ); # parse with a single branch my @p = parse_balanced($str, \@tokens); print Dumper \@p; print "String is ", (is_balanced(@p) ? "" : "not "), "balanced.\n"; # break circular reference @tokens = (); __END__ $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' ', [ '(', ' my.my id >= 1 ', ')' ], ' ', ')' ], ' and ', [ '(', ' ', [ '(', ' is-relative.to code = \'sister\' ', ')' ], ' or ', [ '(', ' is-relative.to code = \'brother\' ', ')' ], ' or ', [ '(', ' is-mother.to code = \'dog\' ', ')' ], ' ', ')' ] ]; String is balanced.

    And here's the module:

      Thanks to all for your responses

      As it turns out I had forgotten a case which all methods are having problems solving.

      For example:

      my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) an +d ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( is- +father.to code intersects bfunc(val1,val2,val3,val4) )";

      Krambambuli - Nice little one liner

      cdarke - I do not have access to feature in this environment and in a very quick look at cpan I could not find what module had feature in it. Sorry, I did not pursue this example.

      Marshell - I liked this, good small brute force example

      repellent - This is really what I was looking for but like the others it is having problems with ( is-mother.to code intersects afunc(val1,val2,val3,val4).

      my $str = "05/04/2010 13:09:45 - A - somebody - ( is-mother.to code in +tersects afunc(val1,val2,val3,val4) )"; $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' is-mother.to code intersects afunc', [ '(', 'val1,val2,val3,val4', ')' ], ' ', ')' ] ]; String is balanced.

      Since it seems that all are having issues with this one scenario I am thinking I will have to combine Parse::Balanced with some brute force operations.

      One idea I have is to just
      • Grab the portions of the line containing this unique pattern(s).
      • Remove it from the line and deal with it seperately.
      • Finish normally with whats left of the resulting line.

      Unless someone can come up with a way to handle the line in it's entirety.

      Result should be my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) and ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( i +s-father.to code intersects bfunc(val1,val2,val3,val4) )"; my.my id >= 1 is-mother.to code intersects afunc(val1,val2,val3,val4) is-father.to code intersects bfunc(val1,val2,val3,val4)
        Add an "escape" token for those function calls:
        @tokens = ( "(", qr/(?:[ab]func\([^)]*\))/, \@tokens, ")" );

        Parsing contents in-between the function call parentheses is left as an exercise.

        By the way, your second $str is unbalanced.