http://qs321.pair.com?node_id=618629
Category: Text Processing
Author/Contact Info raptur jpate@ling.ohio-state.edu
Description: Reads in a directory of corpus files in the format used by the Penn Treebank:

((NP
(ADJP (ADV just) (ADJ another))
(NN perl)(NN hacker)))

finds all the grammar rules demonstrated in the corpus files for a given node, contextualized relative to the local left-sister (or mother if -m is passed). Moreover, it counts how many times each rule is observed, and produces a weighted grammar for the given node only. Designed for exploring the sort of information contained in local context in natural language, and whether there are meaningful clusters of context that improve the accuracy of natural language parsing. For earlier work in this area, see:
Johnson, Mark (1998). The effect of alternative tree representations on tree bank grammars. In D.M.W. Powers (ed.) New Methods in Language Processing and Computational Natural Language Learning, ACL, pp. 39-48.

This is a revised title and description for the earlier post entitled CPCFG_count.pl in response to others' suggestions. This post is otherwise unchanged.

#!/usr/bin/perl
#Author: John Pate
#Written for a project that explored Probabilistic Context Free Gramma
+r
#contextualization possibilites (per Johnson 1998).
#
#counts the expansions of a given node in every Penn-Treebank style 
#file of a given directory, sub-categorizing nodes according to 
#local left-sister or mother context depending on the flag
#
#sister context is default
#
#Performs clever string handling rather than actual tree traversals
#to save time and space.


#identifies flags
#true extraCounts precedes each LHS with the number of RHSs and
#total instances
#true relevant separates each LHS section with a blank line
if($ARGV[0] =~/^-[a-z]+/)
{
    $extraCounts = 1 if $ARGV[0] =~ /c/;
    $relevant = 1 if $ARGV[0] =~ /r/;
    $mother=1 if $ARGV[0]=~/m/;
    shift @ARGV;
}

if($#ARGV<2)
{
    print STDERR "\nUsage: ./CPCFG_count.pl [-crm] node collapsings in
+putdir [countmin]\n\n";
    print STDERR "Options:\n";
    print STDERR "\tc: print extra counts\n";
    print STDERR "\tr: print LHS separating newline\n";
    print STDERR "\tm: subcategorise according to mother rather than l
+ocal left-sister\n\n";
    exit;
}

my ($node, $basicColl, $inputDir, $countmin) = @ARGV;


@aunties=();
@mothers=();
$depth=-1;

#Read in file containing nodes to treat the same
#for example, could treat all verb forms as just verb
#Format: rawCategory:preProcessedCategory
open(COLL,"$basicColl") or die ("Error: $!");
while($line=<COLL>)
{
    chomp $line;
    @coll = split /:/,$line;
    $basicColl{$coll[0]}=$coll[1];
}
close(COLL);

opendir(INPUTCORP,"$inputDir") or die ("Error: $!");
@inputCorpora=readdir(INPUTCORP);
closedir(INPUTCORP);



foreach $file (@inputCorpora)
{
    #skip system files
    if($file=~/^\./)
    {next;}
    
    $path="$inputDir/$file";
    
    open(CORP,"$path") or die ("Error: $!");
    
    $char = getc CORP;
    
    while(defined $char)
    {
        #holdover from debugging
        #should never actually happen
        #uncomment if you have problems
        #if($depth<-1)
        #{
        #    print STDERR "|$depth|$curNode|$path|\n";
        #    exit;
        #}
        
        #left parenthesis signals 1) new node coming and
        #2) one constituent deeper
        if($char eq '(')
        {
            $depth++;
            
            #Any information about these daughters has already been re
+corded elsewhere
            delete $daughters[$depth+1];
            #new node coming
            $curNode='';
        
        
            #read new node (ignore grammatical function labels, which 
+are preceded by '-')
            $char = getc CORP;
            while($char =~ /[\w0-9]/)
            {
                $curNode .= $char;
                $char = getc CORP;
            }
            
            #if this node is to be collapsed, collapse it
            if(defined $basicColl{$curNode})
            {
                $trueCurNode=$basicColl{$curNode}
            }else{
                $trueCurNode=$curNode;
            }
            
            #add to the expansion of the node just read in (ignoring p
+unctuation)
            push @{$daughters[$depth]}, $trueCurNode if $depth>-1 && $
+trueCurNode =~ /\w/;
            
            #remember the current dominating node as the left sister o
+f the node to be read in next
            $aunties[$depth]=$mothers[$depth];
            $mothers[$depth]=$trueCurNode;
            
            $curPos=tell CORP;
            $curPos--;
            seek(CORP, $curPos, 0);
        }elsif($char eq ')'){
            
            #any information about these has already been recorded els
+ewhere
            $aunties[$depth+1]='';
            $mothers[$depth+1]='';
            
            #form the components of the rule
            $rhs=join(', ',@{$daughters[$depth+1]});

            if($mother)
            {
                $lhs="$mothers[$depth]"."_$mothers[$depth-1]";
            }else{
                $lhs="$mothers[$depth]"."_$aunties[$depth]";
            }

            #increment the count for the rule seen
            $rules{$lhs}{$rhs}++ if $rhs =~/\w/ && $mothers[$depth] eq
+ $node;
            $depth--;
            
        }
        $char = getc CORP;
    }
    close(CORP);
}


#print out actual counts
foreach $lhs (keys %rules)
{
    $presentation='';
    $instances=0;
    $rules=0;
    foreach $rhs (keys %{$rules{$lhs}})
    {
        #print STDERR "\t$rhs\n";
        $instances+=$rules{$lhs}{$rhs};
        $rules++;
        $presentation.= "$rules{$lhs}{$rhs}: $lhs --> $rhs\n";
    }
    
    $presentation="$instances\n$rules\n".$presentation if ($extraCount
+s && $instances >= $countmin);
    $presentation.="\n" if ($relevant && $instances >= $countmin);
    print $presentation if $instances >= $countmin;
}


if($depth==-1)
{
    print STDERR "\nBrackets match!\n\n";
}else{
    print STDERR "\nWARNING: Brackets do NOT match!\n\n";
}

print STDERR "\a";
Replies are listed 'Best First'.
Re: CPCFG_count.pl
by jdporter (Paladin) on Jun 01, 2007 at 15:27 UTC

    Yeah, I have a few comments, presented here in decreasing order of importance.

    First and foremost, you should be useing strict and warnings.
    When you add those, you will get a whole slew of errors, which tell you the next thing you should do:

    Use lexical variables to the extent possible. And secondarily to that, declare lexical variables in the nearest enclosing scope as possible. In most cases that will be the innermost scope in which the variable is first used, but occasionally you need it to survive that scope, so moving the declaration out the necessary number of levels is appropriate.

    Avoid "useless use of quotes", aka useless string interpolation. Note that "$foo" evaluates to exactly the same thing as $foo (as long as $foo is already a plain string variable).

    I'd recommend using Getopt::Std, rather than hand-rolling a cmdline option parser.

    Your die statements for file open errors should include the name of the file, and, ideally, what operation was being attempted and failed.

    I recommend using $_ whenever possible, as long as doing so doesn't substantially increase the obfuscation factor. For example, I'd write

    while ( <COLL> ) { chomp; @coll = split /:/; $basicColl{$coll[0]} = $coll[1]; }
    These are the situations for which $_ was invented!

    It's good style always to use a file open mode indicator. It's true that open F, $foo opens the file for reading by default, but, stylistically, open F, "< $foo" (or open F, "<", $foo) is better. (It is said that that last form, the "three-argument open", is best of all, as it prevents a certain class of bugs.)

    Some of your parentheses are unnecessary and add to the clutter (at least in some people's opinion). For example,

    open(CORP,"<$path") or die ("Error...: $!");
    could be written as
    open CORP,"<$path" or die "Error...: $!";

    You might consider using the "ternary operator". It would significantly clean up certain situations, such as this one:

    if($mother) { $lhs="$mothers[$depth]"."_$mothers[$depth-1]"; } else { $lhs="$mothers[$depth]"."_$aunties[$depth]"; }
    That could be written as
    $lhs = join '_', $mothers[$depth], $mother ? $mothers[$depth-1] : $aunties[$depth];

    I'm not sure, but it looks like you're just parsing LISP-like structures, with balanced parentheses. If so, you're in serious wheel reinvention territory. It's fine for learning, but for production you might want to consider using a module such as Text::Balanced, Regexp::Common, Text::PromptBalanced, Parse::RecDescent, or perl-lisp.

    Lastly, I would recommend thinking of a better title for your post. CPCFG_count.pl may work for you as the script's filename, but PerlMonks titles should convey some information. It helps to think of the title as the place to put key words. Thanks.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re: CPCFG_count.pl
by zentara (Archbishop) on Jun 01, 2007 at 13:07 UTC
    Counts, in a Penn-Treebank style corpus, syntactic productions of a given node subcategorised according to local left sister context (or mother context if -m is passed). I've never taken a real class devoted specifically programming, so all critiques are very welcome!

    Can you translate this to English? :-) I've never studied Penn-Treebank.


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum