comment on

Dear perlmonks!

Once again I have a question. Actually, when I write a question to the forum on some problem I face, I think of smaller examples and adjust the code, and I sometimes realise myself how to do it by describing the task in a written form. But this time I just can't do a very simple thing, stuck with it for two days.

This concerns a sentence alignment. I have two files in English and in another foreign language with the same number of lines which are the translations of each other(number of words in lines is different). Like this:

>>>FILE-EN>>>
The cat sees the dog 
The rat is in the cat  
The cat runs

>>>>>FILE-RU>>>>>>
Koshka vidit sobaku
Krisa v koshke
Koshka bezhit
[download]

For each sentence pair in English and Russian and for each English word from FILE_EN I need to calculate the number of unique Russian words that this English word can be hypothetically aligned to. In other words, it is the number of unique words on the Russian side. For example, the word "the" occurs in each sentence and can be aligned to any Russian word, so $uniform{"The"} should be 7 (a word 'Koshka' occurs twice), and I get $uniform{"The"} = 8 - counts with repeated words.

And so far I can calculate the number of not unique words. What shall I use - hash of arrays of unique words? Or some trick with hashes? I commented the staff I have tried - collecting only unique foreign words, this does not work:)

#!/usr/bin/perl
use strict;
use utf8;
use warnings;
use Data::Dumper;

open ENGLISH, "corpus.e" or die $!;
open FOREIGN, "corpus.f" or die $!;
my @sents_en; my @sents_f;
while (<ENGLISH>){
 chomp;
 push @sents_en, $_;
}
while (<FOREIGN>){
 chomp;
 push @sents_f, $_;
}

my %uniform;
my $k;#index of english/foreign sentence
for ($k = 0; $k <= $#sents_en; $k++){
   my @words_en; my @words_f;
   @words_en = map { split / / } $sents_en[$k];
   @words_f = map { split / / } $sents_f[$k];
   my $j;
   for ($j = 0; $j <= $#words_en; $j++ ){
    my $i;
    my %seen;
       for ($i = 0; $i <= $#words_f; $i++){
                #$seen{$words_f[$i]}++; #TRY TO COUNT UNIQUE WORDS
                if ( defined( $uniform{ $words_en[$j] } ) ) { # and !$
+seen{$words_f[$i]}) ) {

                    $uniform{ $words_en[$j] } ++;
                }
                else {
                    $uniform{ $words_en[$j]} = 1;
                }

       }
    }
}  
print Dumper \%uniform;
[download]

That are the numbers I get:

$VAR1 = {
          'the' => 6,
          'rat' => 3,
          'is' => 3,
          'cat' => 8,
          'dog' => 3,
          'in' => 3,
          'runs' => 2,
          'sees' => 3,
          'The' => 8
        };
[download]

...and I need the counts for unique words. Thank you in advance and sorry for too many letters:)

In reply to hash of unique words by Ninke

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks