http://qs321.pair.com?node_id=287887


in reply to Constructive criticism of a dictionary / text comparison script

Hi allolex. There is a problem that nobody has yet mentioned. It concerns this line:

next if $element =~ /[^A-Za-zĄ-’]/;

This is doing a lot more than you want it too, I think. Basically, it means "ignore any $element containing a character not in the set defined between square brackets". It is therefore stripping out, for example, any 'word' with attached punctuation. For example, in a sentence such as:

"Shut up!" he said.

you are throwing away three quarters of your 'words'! And you are also, of course, ignoring hyphenated words

It also means that the line:

$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;

never actually does anything, with or without surplus backslashes...

hth

dave

Replies are listed 'Best First'.
Re: Re: Constructive criticism of a dictionary / text comparison script
by allolex (Curate) on Aug 30, 2003 at 08:56 UTC

    Oops!

    sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ '-]/,$_); # split on hyphens, too foreach my $element (@elements) { next if $element =~ /\d/; # Don't need digits $element = lc($element); $element =~ s/[\s,!?._;)("'-]//g; # thanks sauoq next if $element eq ''; print "[$element]\n" if $token_debug; if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; } } } }

    Thanks a lot! I think that was another relic from a previous version. I'm glad you caught it.

    --
    Allolex