Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Constructive criticism of a dictionary / text comparison script

by Not_a_Number (Prior)
on Aug 30, 2003 at 08:49 UTC ( [id://287887]=note: print w/replies, xml ) Need Help??


in reply to Constructive criticism of a dictionary / text comparison script

Hi allolex. There is a problem that nobody has yet mentioned. It concerns this line:

next if $element =~ /[^A-Za-zĄ-’]/;

This is doing a lot more than you want it too, I think. Basically, it means "ignore any $element containing a character not in the set defined between square brackets". It is therefore stripping out, for example, any 'word' with attached punctuation. For example, in a sentence such as:

"Shut up!" he said.

you are throwing away three quarters of your 'words'! And you are also, of course, ignoring hyphenated words

It also means that the line:

$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;

never actually does anything, with or without surplus backslashes...

hth

dave

Replies are listed 'Best First'.
Re: Re: Constructive criticism of a dictionary / text comparison script
by allolex (Curate) on Aug 30, 2003 at 08:56 UTC

    Oops!

    sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ '-]/,$_); # split on hyphens, too foreach my $element (@elements) { next if $element =~ /\d/; # Don't need digits $element = lc($element); $element =~ s/[\s,!?._;)("'-]//g; # thanks sauoq next if $element eq ''; print "[$element]\n" if $token_debug; if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; } } } }

    Thanks a lot! I think that was another relic from a previous version. I'm glad you caught it.

    --
    Allolex

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://287887]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-26 07:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found