Category: | Text Processing |
Author/Contact Info | /msg allolex |
Description: | As promised in Constructive criticism of a dictionary / text comparison script, here is a cleaned-up version of the dictionary comparison code that the monks helped me with. What follows I just copied out of the POD in the script itself. It might make this code easier to find. A generic script for building dictionaries by comparing them to real-world texts. This program compares the words in a given text file to a list of words from a dictionary file. It is capable of outputting lists of words that occur or do not occur in a given dictionary file, along with their frequency in the text. Debugging output using token tag marks is also available. |
#!/usr/bin/perl # POD can be found at the bottom of this script use strict; use warnings; use Compress::Zlib; use Getopt::Long; use Pod::Usage; my $VERSION = 0.81; my $dictfile = 'dict.gz'; # Process command-line options my %cl_options = ( help => '', version => '', token_debug => '', glossary_output => '', dictionary_output => '' ); GetOptions( 'help|?' => \$cl_options{help}, 'version' => \$cl_options{version}, 'man' => \$cl_options{man}, 'token-debug' => \$cl_options{token_debug}, 'glossary' => \$cl_options{glossary_output}, 'dictionary' => \$cl_options{dictionary_output} ); print "This is version $VERSION of $0.\n" if $cl_options{version}; exit(0) if ($cl_options{version}); pod2usage(-exitstatus => 0, -verbose => 1, -msg => "Help for $0") if $ +cl_options{help}; pod2usage(-exitstatus => 0, -verbose => 2, -msg => "Man page for $0") +if $cl_options{man}; my $file = shift; my %dictionary = readdict(\$dictfile); my %glossary; findwords(); printlexicon(\%dictionary) if $cl_options{dictionary_output}; printlexicon(\%glossary) if $cl_options{glossary_output}; # Readdict reads in the dictionary file defined above using # the Compress:Zlib CPAN module. It returns a hash that is # used for all further dictionary operations. # sub readdict { my $dict = shift; my %dicthash; my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno +\n" ; while ($gz->gzreadline($_) > 0) { chomp; $dicthash{lc($_)} = 0; } die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA +M_END ; return %dicthash; } # findwords() reads in a file and compares words found in the file # with the contents of the dictionary read in by the readdict # function. It assigns counts to the elements of %dictionary and # creates %glossary elements and increases its values according to # the number of matches. sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ '-]/,$_); # split on hyphens, too foreach my $element (@elements) { next if $element =~ /\d/; # Don't need digits print "[$element]->" if $cl_options{token_debug}; $element = lc($element); $element =~ s/[\s,!?._;«»)("'-]//g; print "[$element]\n" if $cl_options{token_debug}; next if $element eq ''; if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; } } } } # Showmatches reads in a lexicon hash via a reference and prints all +words out # that have been seen in the findwords() function along with a freque +ncy count. # sub printlexicon { my $lexicon = shift; my $counter = 0; foreach my $key (sort keys %$lexicon) { if ( $$lexicon{$key} > 0 ) { print $key . " : " . $$lexicon{$key} . "\n"; $counter++; } } print "\n$counter entries total\n"; } __END__ =pod =head1 dict-compare A generic script for building dictionaries by comparing them to real-w +orld texts. =head1 DESCRIPTION This program compares the words in a given text file to a list of word +s from a dictionary file. It is capable of outputting lists of words that oc +cur or do not occur in a given dictionary file, along with their frequency in + the text. Debugging output using token tag marks is also available. =head1 SYNOPSIS C<dict-compare [--glossary --dictionary] [--token-debug] file > output +_file> =head2 OPTIONS =over 12 =item C<--help,-h,-?> Prints a usage help screen. =item C<--man,-m> Prints out the manual entry for $0 =item C<--version,-v> Prints out the program version. =item C<--glossary> Prints a glossary of words not found in the dictionary file and the nu +mber of times they occur. =item C<--dictionary> Prints out the words from the text that had a dictionary match, along +with their respective frequencies. =item C<--token-debug> Prints tags around each token in the text to help sound out strange to +kens. The tokens themselves are printed side-by-side to show how the script +cleans up the results. =back =head1 EXAMPLE C<dict-compare --glossary myfile.txt> This command reads in the text contained in myfile.txt and prints out +a list of words not found in the dictionary and their frequencies. =back =head1 DICTIONARY FORMAT The dictionary is a one-word-per-line file that has been gzipped. You +r dictionary can be anything. Think of the possibilities. =head1 THANKS The following people have reviewed and offered inprovements to this co +de: =over 12 =item B<Sauoq> L<http://www.perlmonks.org/index.pl?node_id=182681> =item B<adjelore> L<http://www.perlmonks.org/index.pl?node_id=131479> =item B<Hutta> L<http://www.perlmonks.org/index.pl?node_id=117788> =item B<TomDLux> L<http://www.perlmonks.org/index.pl?node_id=144696> =item B<Not_A_Number> L<http://www.perlmonks.org/index.pl?node_id=2587 +24> =back And of course all of the others at the Monastery, Cologne.pm whose hel +p can only be seen in its cumulative effect. =head1 AUTHOR Damon "allolex" Davison - <allolex@sdf.freeshell.org> =head1 LICENSE This code is released under the same terms as Perl itself. =cut |
Back to
Code Catacombs