Constructive criticism of a dictionary / text comparison script

allolex has asked for the wisdom of the Perl Monks concerning the following question:

I originally wrote this dictionary comparison tool as part of an ongoing linguistics project. The script compares a text file with a compressed dictionary file (one word per line) and spits out various bits of information. You can use it to get a list of the words in your text that match the dictionary, the words that do not match the dictionary, and to print out debugging information if strange tokens are printing out in your word lists. For the word list options, it also prints out the number of matches for a particular token.

the script is useful because it is not possible for any single dictionary to serve all needs. This script can quickly show how a well a dictionary matches the texts it is used on. (For the linguists out there, think about the possibilities of a lexicon that only covers a particular word field or word set and allows you to compare that with any given text.)

Basically, what I am looking for is a critique of my code and style, turning this code (which does work, BTW) into a learning experience for me. So here is the whole thing (including POD) in <readmore> tags. Thanks in advance.

PS: I plan to put this code in the Catacombs once it has undergone sufficient peer review... :)

#!/usr/bin/perl

use strict;
use warnings;
use Compress::Zlib;
use Getopt::Long;
use Pod::Usage;

my $VERSION = 0.7;
my $dictfile = 'dict.gz';

#  Process command-line options

my $help = '';
my $man = '';
my $version = '';
my $token_debug = '';
my $glossary_output = '';
my $dictionary_output = '';

GetOptions( 'help|?' => \$help, 'version' => \$version, 'man' => \$man
+, 'token-debug' => \$token_debug, 'glossary' => \$glossary_output, 'd
+ictionary' => \$dictionary_output );

print "This is version $VERSION of $0.\n" if $version;
exit(0) if ($version);
pod2usage(1) if $help;
pod2usage(-exitstatus => 0, -verbose => 2) if $man;

my $file = shift;
my %dictionary = readdict(\$dictfile);
my %glossary;

findwords();

printlexicon(\%dictionary) if $dictionary_output;
printlexicon(\%glossary) if $glossary_output;


#  Readdict reads in the dictionary file defined above using
#  the Compress:Zlib CPAN module.  It returns a hash that is
#  used for all further dictionary operations.
#
sub readdict {
    my $dict = shift;
    my %dicthash;

    my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno
+\n" ;
    while ($gz->gzreadline($_) > 0) {
        chomp;
        $dicthash{lc($_)} = 0;
    }
    die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA
+M_END ;
    return %dicthash;
}

#  findwords() reads in a file and compares words found in the file
#  with the contents of the dictionary read in by the readdict
#  function.  It assigns counts to the elements of %dictionary and
#  creates %glossary elements and increases its values according to
#  the number of matches.
#
sub findwords {
    open my $if, "<", $file || die "Could not open $file: $!";
    while (<$if>) {
        chomp;
        my @elements = split(/[ ']/,$_);
        foreach my $element (@elements) {
            next if $element =~ /[^A-Za-zÀ-ÿ]/; #  Don't need digits
            $element = lc($element);
            $element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;
            next if $element eq '';
            print "[$element]\n" if $token_debug;
            #  If the word matches a word in the dictionary, increase
            #  the match count by one, otherwise assign it to the
            #  glossary of words not found in the dictionary and up
            #  the glossary count.
            if ( exists $dictionary{$element} ) {
                $dictionary{$element}++;
            } else {
                $glossary{$element}++;
            }
        }
    }
}

#  Showmatches reads in a lexicon hash via a reference and prints all 
+words out 
#  that have been seen in the findwords() function along with a freque
+ncy count.
#
sub  printlexicon {
    my $lexicon = shift;
    my $counter = 0;
    foreach my $key (sort keys %$lexicon) {
        if ( $$lexicon{$key} > 0 ) {
            print $key . " : " . $$lexicon{$key} . "\n";
            $counter++;
        }
    }
    print "\n$counter entries total\n";
}

__END__

=pod

=head1 dict-compare

A generic script for building dictionaries by comparing them to real-w
+orld texts.

=head1 SYNOPSIS

C<dict-compare [--glossary --dictionary] [--token-debug] file > output
+_file>

=head1 DESCRIPTION

This program compares the words in a given text file to a list of word
+s from
a dictionary file.  It is capable of outputting lists of words that oc
+cur or
do not occur in a given dictionary file, along with their frequency in
+ the
text.  Debugging output using token tag marks is also available.

=head2 Command-Line Options

=over 12

=item C<--help,-h,-?>

Prints a usage help screen.

=item C<--man,-m>

Prints out the manual entry for $0

=item C<--version,-v>

Prints out the program version.

=item C<--glossary>

Prints a glossary of words not found in the dictionary file and the nu
+mber of
times they occur.

=item C<--dictionary>

Prints out the words from the text that had a dictionary match, along 
+with
their respective frequencies.

=item C<--token-debug>

Prints tags around each token in the text to help sound out strange to
+kens.

=back

=head1 EXAMPLE

C<dict-compare --glossary myfile.txt>

This command reads in the text contained in myfile.txt and prints out 
+a list
of words not found in the dictionary and their frequencies.

=back

=head1 AUTHOR

Damon "allolex" Davison - <allolex@sdf.freeshell.org>

=head1 LICENSE

This code is released under the same terms as Perl itself.

=cut
[download]

NB: If you want to reproduce the dictionary so you can actually run this script as-is, *nix users can take the words file (/usr/share/dict/) and compress it as dict.gz using gzip. Alternatively, you could just write five lines/five words in a text editor and compress it...

--
Allolex

Comment on Constructive criticism of a dictionary / text comparison script Download Code

Replies are listed 'Best First'.

Re: Constructive criticism of a dictionary / text comparison script
by sauoq (Abbot) on Aug 29, 2003 at 23:21 UTC

All in all, it looks fine. And the fact that it works is a big point in its favor. :-) The one thing that immediately stood out to me was your backwhack happiness in the character class in this line:

$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;

[\s,!?._;("'-]
[download]

[]^\\-]

Also, on a different line you used a literal space inside the character class. That's fine but sometimes it is easier to read if you use \x20 instead.

-sauoq
"My two cents aren't worth a dime.";

[reply]
[d/l]
[select]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 06:06 UTC

Yes. Definitely backslash-happy. This is something that I have wondered about, but never really remembered to look up or ask. It's much easier to read your way. No fear, I will not go defining character class ranges with the dash. :) And using \x20 instead of a literal space is something that never occurred to me before, but seems like such an obviously good idea, that I'll now probably write a bunch of scripts that totally overuse it. ;)

--
Allolex

[reply]

Re: Constructive criticism of a dictionary / text comparison script
by ajdelore (Pilgrim) on Aug 29, 2003 at 22:50 UTC

This is more a suggestion on functionality than a critique of code. One thing that I ran into with my boggle script is that the unix dict file doesn't have variants of words. For example, it has huge but not hugely, fish but not fishes or fishing, etc.

Ideally, you would have some kind of functionality to address this. One possibility is to stem words before you check them. I know that Lingua::Stem implements one popular algorithim to do this. I didn't look into it close enough to see if it would do the trick for me.

</ajdelore>

[reply]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 06:35 UTC

I really like your idea and it would work very well if I were dealing with texts languages that all had a stemming module. I am seriously considering writing one for French. Currently, I am working with Italian, which does have Lingua::Stem::It, but my dictionary has word forms as well. The huge advantage of working with a stemmer is that it is also capable of stemming novel constructions (like stemage), which the dictionary does not account for. It would be a very interesting modification to create a dictionary of stem forms, but it would also be a lot more work checking its accuracy.

What would really be cool is a stemming module that defined all affixes via a hash of some kind, so that tense, mode/mood, plural, person, etc. could be looked up like

my %hash_of_verb_suffixes = (
   future => qw([ei]rò [ei]rai [ei]rà [ei]remo [ei]rete [ei]ranno),
   conditional => qw([ei]rei [ei]resti [ei]rebbe [ei]remmo [ei]reste [
+ei]rebbero)
)
[download]

and so on.

Oh, wait. That's a POS tagger;)

In any case, I can see we think along similar lines. Thanks!

--
Allolex

[reply]
[d/l]

Re: Constructive criticism of a dictionary / text comparison script
by Hutta (Scribe) on Aug 29, 2003 at 23:41 UTC

GetOptions(
        'help|?'      => \$help,
        'version'     => \$version,
        'man'         => \$man,
        'token-debug' => \$token_debug, 
        'glossary'    => \$glossary_output,
        'dictionary'  => \$dictionary_output    
);
[download]

[reply]
[d/l]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 05:52 UTC

Yes, that looks is a lot easier to read than the separate option declarations. Actually, I thought I reformatted that before I posted it here... Go figure :) I also like the idea of declaring the option init values in a hash, since it would make the code more legible (not to mention namespace economy). I'll definitely make both of those changes.

--
Allolex

[reply]

Re: Re: Constructive criticism of a dictionary / text comparison script

by exussum0 (Vicar) on Aug 30, 2003 at 20:11 UTC

add(-number1=>10,-nuber2=>0),

you get a right result, but not the right way. Granted, this is the easist bug to bring out, try adding 5 and 3 to get 0. But when you do it with more complex scenarios, you can get really weird software bugs. Also, it makes it harder to refactor code, when you wish to remove parameters, add them or make different requirements, 'cuz when they get called, they may not break.. unless you check for every old parameter and new one in your functions. Yuck. Just a rant :)

[reply]
[d/l]

Re: Constructive criticism of a dictionary / text comparison script
by Not_a_Number (Prior) on Aug 30, 2003 at 08:49 UTC

Hi allolex. There is a problem that nobody has yet mentioned. It concerns this line:

next if $element =~ /[^A-Za-zÀ-ÿ]/;

This is doing a lot more than you want it too, I think. Basically, it means "ignore any $element containing a character not in the set defined between square brackets". It is therefore stripping out, for example, any 'word' with attached punctuation. For example, in a sentence such as:

"Shut up!" he said.

you are throwing away three quarters of your 'words'! And you are also, of course, ignoring hyphenated words

It also means that the line:

$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;

never actually does anything, with or without surplus backslashes...

hth

dave

[reply]
[d/l]
[select]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 08:56 UTC

Oops!

sub findwords {
    open my $if, "<", $file || die "Could not open $file: $!";
    while (<$if>) {
        chomp;
        my @elements = split(/[ '-]/,$_); # split on hyphens, too
        foreach my $element (@elements) {
            next if $element =~ /\d/; #  Don't need digits
            $element = lc($element);
            $element =~ s/[\s,!?._;)("'-]//g; # thanks sauoq
            next if $element eq '';
            print "[$element]\n" if $token_debug;
            if ( exists $dictionary{$element} ) {
                $dictionary{$element}++;
            } else {
                $glossary{$element}++;
            }
        }
    }
}
[download]

Thanks a lot! I think that was another relic from a previous version. I'm glad you caught it.

--
Allolex

[reply]
[d/l]

Re: Constructive criticism of a dictionary / text comparison script
by TomDLux (Vicar) on Aug 30, 2003 at 02:20 UTC

Initializing variables to an empty string is generally no advantage. Why not collapse those declarations into one line?

my ( $help, $man, $version, $token_debug, $glossary_output, $dictionar
+y_output );
# or 
my ( $help, $man, $version, $token_debug, $glossary_output, $dictionar
+y_output ) = ( '', '', '', '', '', '' );
[download]

GetOptions will also take a reference to a hash as its first argument, instead of all the variables. Much more concise and you don't have to worry three pages down what that variable was called.

--
TTTATCGGTCGTTATATAGATGTTTGCA

[reply]
[d/l]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 05:58 UTC

You put your finger right on a major problem: I didn't actually go through and fix things that were a result of adding features to the script over a few days. Your version is a lot more legible (and looks cooler) than mine. I think I'll take you last bit of advice and stick everything in a hash.

--
Allolex

[reply]

Re: Re: Constructive criticism of a dictionary / text comparison script

by PodMaster (Abbot) on Oct 13, 2003 at 10:57 UTC

my( $help,
     $man,
 $version,
$token_debug,
$glossary_output, $dictionary_output ) = ('') x 6;
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Constructive criticism of a dictionary / text comparison script
by halley (Prior) on Aug 30, 2003 at 15:44 UTC

Moby Lexicon Project

--
[ e d @ h a l l e y . c c ]

[reply]

Re: Re: Constructive criticism of a dictionary / text comparison script

by allolex (Curate) on Aug 30, 2003 at 22:36 UTC

Thanks. I didn't know about the project, but I don't work with English much at all. What might be interesting is to compile a list of similar resources for other languages as well. I wish there were a compendium linguisticae for precisely this sort of thing, but I think there are too many people working on very particular projects for this to happen. Maybe me... someday.

Before I completely forget and go off on another tangent (they kind of happen to me a lot), have you seen Kevin's Word List Page? There are some really interesting specialty lists there.

--
Allolex

[reply]

Re: Constructive criticism of a dictionary / text comparison script
by allolex (Curate) on Sep 03, 2003 at 18:15 UTC

The updated version of this code can be found at "dict-compare: a dictionary evaluation script" in the Code Catacombs. Many thanks to all of you. :)

--
Allolex

[reply]


Perl: the Markov chain saw
	PerlMonks