comment on

Dear perlmonks!

Once again I have a question. Actually, when I write a question to the forum on some problem I face, I think of smaller examples and adjust the code, and I sometimes realise myself how to do it by describing the task in a written form. But this time I just can't do a very simple thing, stuck with it for two days.

This concerns a sentence alignment. I have two files in English and in another foreign language with the same number of lines which are the translations of each other(number of words in lines is different). Like this:

>>>FILE-EN>>>
The cat sees the dog 
The rat is in the cat  
The cat runs

>>>>>FILE-RU>>>>>>
Koshka vidit sobaku
Krisa v koshke
Koshka bezhit
[download]

For each sentence pair in English and Russian and for each English word from FILE_EN I need to calculate the number of unique Russian words that this English word can be hypothetically aligned to. In other words, it is the number of unique words on the Russian side. For example, the word "the" occurs in each sentence and can be aligned to any Russian word, so $uniform{"The"} should be 7 (a word 'Koshka' occurs twice), and I get $uniform{"The"} = 8 - counts with repeated words.

And so far I can calculate the number of not unique words. What shall I use - hash of arrays of unique words? Or some trick with hashes? I commented the staff I have tried - collecting only unique foreign words, this does not work:)

#!/usr/bin/perl
use strict;
use utf8;
use warnings;
use Data::Dumper;

open ENGLISH, "corpus.e" or die $!;
open FOREIGN, "corpus.f" or die $!;
my @sents_en; my @sents_f;
while (<ENGLISH>){
 chomp;
 push @sents_en, $_;
}
while (<FOREIGN>){
 chomp;
 push @sents_f, $_;
}

my %uniform;
my $k;#index of english/foreign sentence
for ($k = 0; $k <= $#sents_en; $k++){
   my @words_en; my @words_f;
   @words_en = map { split / / } $sents_en[$k];
   @words_f = map { split / / } $sents_f[$k];
   my $j;
   for ($j = 0; $j <= $#words_en; $j++ ){
    my $i;
    my %seen;
       for ($i = 0; $i <= $#words_f; $i++){
                #$seen{$words_f[$i]}++; #TRY TO COUNT UNIQUE WORDS
                if ( defined( $uniform{ $words_en[$j] } ) ) { # and !$
+seen{$words_f[$i]}) ) {

                    $uniform{ $words_en[$j] } ++;
                }
                else {
                    $uniform{ $words_en[$j]} = 1;
                }

       }
    }
}  
print Dumper \%uniform;
[download]

That are the numbers I get:

$VAR1 = {
          'the' => 6,
          'rat' => 3,
          'is' => 3,
          'cat' => 8,
          'dog' => 3,
          'in' => 3,
          'runs' => 2,
          'sees' => 3,
          'The' => 8
        };
[download]

...and I need the counts for unique words. Thank you in advance and sorry for too many letters:)

In reply to hash of unique words by Ninke

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.