hash of unique words

Ninke has asked for the wisdom of the Perl Monks concerning the following question:

Dear perlmonks!

Once again I have a question. Actually, when I write a question to the forum on some problem I face, I think of smaller examples and adjust the code, and I sometimes realise myself how to do it by describing the task in a written form. But this time I just can't do a very simple thing, stuck with it for two days.

This concerns a sentence alignment. I have two files in English and in another foreign language with the same number of lines which are the translations of each other(number of words in lines is different). Like this:

>>>FILE-EN>>>
The cat sees the dog 
The rat is in the cat  
The cat runs

>>>>>FILE-RU>>>>>>
Koshka vidit sobaku
Krisa v koshke
Koshka bezhit
[download]

For each sentence pair in English and Russian and for each English word from FILE_EN I need to calculate the number of unique Russian words that this English word can be hypothetically aligned to. In other words, it is the number of unique words on the Russian side. For example, the word "the" occurs in each sentence and can be aligned to any Russian word, so $uniform{"The"} should be 7 (a word 'Koshka' occurs twice), and I get $uniform{"The"} = 8 - counts with repeated words.

And so far I can calculate the number of not unique words. What shall I use - hash of arrays of unique words? Or some trick with hashes? I commented the staff I have tried - collecting only unique foreign words, this does not work:)

#!/usr/bin/perl
use strict;
use utf8;
use warnings;
use Data::Dumper;

open ENGLISH, "corpus.e" or die $!;
open FOREIGN, "corpus.f" or die $!;
my @sents_en; my @sents_f;
while (<ENGLISH>){
 chomp;
 push @sents_en, $_;
}
while (<FOREIGN>){
 chomp;
 push @sents_f, $_;
}

my %uniform;
my $k;#index of english/foreign sentence
for ($k = 0; $k <= $#sents_en; $k++){
   my @words_en; my @words_f;
   @words_en = map { split / / } $sents_en[$k];
   @words_f = map { split / / } $sents_f[$k];
   my $j;
   for ($j = 0; $j <= $#words_en; $j++ ){
    my $i;
    my %seen;
       for ($i = 0; $i <= $#words_f; $i++){
                #$seen{$words_f[$i]}++; #TRY TO COUNT UNIQUE WORDS
                if ( defined( $uniform{ $words_en[$j] } ) ) { # and !$
+seen{$words_f[$i]}) ) {

                    $uniform{ $words_en[$j] } ++;
                }
                else {
                    $uniform{ $words_en[$j]} = 1;
                }

       }
    }
}  
print Dumper \%uniform;
[download]

That are the numbers I get:

$VAR1 = {
          'the' => 6,
          'rat' => 3,
          'is' => 3,
          'cat' => 8,
          'dog' => 3,
          'in' => 3,
          'runs' => 2,
          'sees' => 3,
          'The' => 8
        };
[download]

...and I need the counts for unique words. Thank you in advance and sorry for too many letters:)

Comment on hash of unique words Select or Download Code

Replies are listed 'Best First'.
Re: hash of unique words by choroba (Cardinal) on Apr 22, 2013 at 16:37 UTC
Rather than hash of arrays, use a hash of hashes. At the end, replace each inner hash with its number of keys: #!/usr/bin/perl use warnings; use strict; use Data::Dumper; open my $ENGLISH, '<', 'corpus.e' or die $!; open my $FOREIGN, '<', 'corpus.f' or die $!; chomp(my @sents_en = <$ENGLISH>); chomp(my @sents_f = <$FOREIGN>); my %uniform; for my $sentence_index (0 .. $#sents_en) { my @words_en = split ' ', $sents_en[$sentence_index]; my @words_f = split ' ', $sents_f[$sentence_index]; for my $word_index (0 .. $#words_en) { $uniform{ $words_en[$word_index] }{$_}++ for @words_f; } } for my $word (keys %uniform) { $uniform{$word} = keys %{ $uniform{$word} }; } print Dumper \%uniform; [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: hash of unique words by Ninke (Novice) on Apr 22, 2013 at 17:50 UTC
Choroba, thanx very much, that does exactly what I want. A nice trick with {$_} for @words, that reduced the number of lines twice:) Though I don't understand the magic to the end, especially when a hash of hashes ($uniform{ $words_en$word_index }{$_}) turnes into a one-dimentional hash: $uniform{$word}. I just need to use it in practice and then I'll get it:)	[reply]


Do you know where your variables are?
	PerlMonks