Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Dear perlmonks!

Once again I have a question. Actually, when I write a question to the forum on some problem I face, I think of smaller examples and adjust the code, and I sometimes realise myself how to do it by describing the task in a written form. But this time I just can't do a very simple thing, stuck with it for two days.

This concerns a sentence alignment. I have two files in English and in another foreign language with the same number of lines which are the translations of each other(number of words in lines is different). Like this:

>>>FILE-EN>>> The cat sees the dog The rat is in the cat The cat runs >>>>>FILE-RU>>>>>> Koshka vidit sobaku Krisa v koshke Koshka bezhit
For each sentence pair in English and Russian and for each English word from FILE_EN I need to calculate the number of unique Russian words that this English word can be hypothetically aligned to. In other words, it is the number of unique words on the Russian side. For example, the word "the" occurs in each sentence and can be aligned to any Russian word, so $uniform{"The"} should be 7 (a word 'Koshka' occurs twice), and I get $uniform{"The"} = 8 - counts with repeated words.

And so far I can calculate the number of not unique words. What shall I use - hash of arrays of unique words? Or some trick with hashes? I commented the staff I have tried - collecting only unique foreign words, this does not work:)

#!/usr/bin/perl use strict; use utf8; use warnings; use Data::Dumper; open ENGLISH, "corpus.e" or die $!; open FOREIGN, "corpus.f" or die $!; my @sents_en; my @sents_f; while (<ENGLISH>){ chomp; push @sents_en, $_; } while (<FOREIGN>){ chomp; push @sents_f, $_; } my %uniform; my $k;#index of english/foreign sentence for ($k = 0; $k <= $#sents_en; $k++){ my @words_en; my @words_f; @words_en = map { split / / } $sents_en[$k]; @words_f = map { split / / } $sents_f[$k]; my $j; for ($j = 0; $j <= $#words_en; $j++ ){ my $i; my %seen; for ($i = 0; $i <= $#words_f; $i++){ #$seen{$words_f[$i]}++; #TRY TO COUNT UNIQUE WORDS if ( defined( $uniform{ $words_en[$j] } ) ) { # and !$ +seen{$words_f[$i]}) ) { $uniform{ $words_en[$j] } ++; } else { $uniform{ $words_en[$j]} = 1; } } } } print Dumper \%uniform;
That are the numbers I get:
$VAR1 = { 'the' => 6, 'rat' => 3, 'is' => 3, 'cat' => 8, 'dog' => 3, 'in' => 3, 'runs' => 2, 'sees' => 3, 'The' => 8 };
...and I need the counts for unique words. Thank you in advance and sorry for too many letters:)


In reply to hash of unique words by Ninke

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-26 03:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found