Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I originally wrote this dictionary comparison tool as part of an ongoing linguistics project. The script compares a text file with a compressed dictionary file (one word per line) and spits out various bits of information. You can use it to get a list of the words in your text that match the dictionary, the words that do not match the dictionary, and to print out debugging information if strange tokens are printing out in your word lists. For the word list options, it also prints out the number of matches for a particular token.

the script is useful because it is not possible for any single dictionary to serve all needs. This script can quickly show how a well a dictionary matches the texts it is used on. (For the linguists out there, think about the possibilities of a lexicon that only covers a particular word field or word set and allows you to compare that with any given text.)

Basically, what I am looking for is a critique of my code and style, turning this code (which does work, BTW) into a learning experience for me. So here is the whole thing (including POD) in <readmore> tags. Thanks in advance.

PS: I plan to put this code in the Catacombs once it has undergone sufficient peer review... :)

#!/usr/bin/perl use strict; use warnings; use Compress::Zlib; use Getopt::Long; use Pod::Usage; my $VERSION = 0.7; my $dictfile = 'dict.gz'; # Process command-line options my $help = ''; my $man = ''; my $version = ''; my $token_debug = ''; my $glossary_output = ''; my $dictionary_output = ''; GetOptions( 'help|?' => \$help, 'version' => \$version, 'man' => \$man +, 'token-debug' => \$token_debug, 'glossary' => \$glossary_output, 'd +ictionary' => \$dictionary_output ); print "This is version $VERSION of $0.\n" if $version; exit(0) if ($version); pod2usage(1) if $help; pod2usage(-exitstatus => 0, -verbose => 2) if $man; my $file = shift; my %dictionary = readdict(\$dictfile); my %glossary; findwords(); printlexicon(\%dictionary) if $dictionary_output; printlexicon(\%glossary) if $glossary_output; # Readdict reads in the dictionary file defined above using # the Compress:Zlib CPAN module. It returns a hash that is # used for all further dictionary operations. # sub readdict { my $dict = shift; my %dicthash; my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno +\n" ; while ($gz->gzreadline($_) > 0) { chomp; $dicthash{lc($_)} = 0; } die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA +M_END ; return %dicthash; } # findwords() reads in a file and compares words found in the file # with the contents of the dictionary read in by the readdict # function. It assigns counts to the elements of %dictionary and # creates %glossary elements and increases its values according to # the number of matches. # sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ ']/,$_); foreach my $element (@elements) { next if $element =~ /[^A-Za-zĄ-’]/; # Don't need digits $element = lc($element); $element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g; next if $element eq ''; print "[$element]\n" if $token_debug; # If the word matches a word in the dictionary, increase # the match count by one, otherwise assign it to the # glossary of words not found in the dictionary and up # the glossary count. if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; } } } } # Showmatches reads in a lexicon hash via a reference and prints all +words out # that have been seen in the findwords() function along with a freque +ncy count. # sub printlexicon { my $lexicon = shift; my $counter = 0; foreach my $key (sort keys %$lexicon) { if ( $$lexicon{$key} > 0 ) { print $key . " : " . $$lexicon{$key} . "\n"; $counter++; } } print "\n$counter entries total\n"; } __END__ =pod =head1 dict-compare A generic script for building dictionaries by comparing them to real-w +orld texts. =head1 SYNOPSIS C<dict-compare [--glossary --dictionary] [--token-debug] file > output +_file> =head1 DESCRIPTION This program compares the words in a given text file to a list of word +s from a dictionary file. It is capable of outputting lists of words that oc +cur or do not occur in a given dictionary file, along with their frequency in + the text. Debugging output using token tag marks is also available. =head2 Command-Line Options =over 12 =item C<--help,-h,-?> Prints a usage help screen. =item C<--man,-m> Prints out the manual entry for $0 =item C<--version,-v> Prints out the program version. =item C<--glossary> Prints a glossary of words not found in the dictionary file and the nu +mber of times they occur. =item C<--dictionary> Prints out the words from the text that had a dictionary match, along +with their respective frequencies. =item C<--token-debug> Prints tags around each token in the text to help sound out strange to +kens. =back =head1 EXAMPLE C<dict-compare --glossary myfile.txt> This command reads in the text contained in myfile.txt and prints out +a list of words not found in the dictionary and their frequencies. =back =head1 AUTHOR Damon "allolex" Davison - <allolex@sdf.freeshell.org> =head1 LICENSE This code is released under the same terms as Perl itself. =cut

NB: If you want to reproduce the dictionary so you can actually run this script as-is, *nix users can take the words file (/usr/share/dict/) and compress it as dict.gz using gzip. Alternatively, you could just write five lines/five words in a text editor and compress it...

--
Allolex


In reply to Constructive criticism of a dictionary / text comparison script by allolex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (None)
    As of 2024-04-25 01:55 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found