Re^13: statistics of a large text

the number of elements of the value array of each key!

This: my $prob_es = ( $#{$hash_es{$string_es}} + 1) / 6939873;

is better written as: my $prob_es = ( @{ $hash_es{$string_es} } ) / 6939873;

The same goes for MI, I need the intersection between the array of both value arrays for the keys to the two hashes.

Okay. Then I think these refactorings should do the trick and save considerable time and space in the process:

sub MI {
    my( $string_es, $string_en, $hash_es, $hash_en ) = @_;

    my $prob_es = ( @{ $hash_es{ $string_es } } ) / 6939873;
    my $prob_en = ( @{ $hash_en{ $string_en } } ) / 6939873;

    my $intersection = Intersection( 
        $hash_es{ $string_es }, $hash_en{ $string_en }
    );

    my $prob_es_en = ( $intersection ) / 6939873;

    $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1) ) / 1.1
+;

    my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en ) 
+);

    return $mi;
}

sub Intersection {
    my( $refA, $refB ) = @_;

    my %counts;
    ++$counts{ $_ } for @$refA;
    ++$counts{ $_ } for @$refB;

    my $intersects = 0;
    $counts{ $_ } > 1 and ++$intersects for keys %counts;

    return $intersects;
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^13: statistics of a large text Select or Download Code