in reply to Re^12: statistics of a large text
in thread statistics of a large text
the number of elements of the value array of each key!
This: my $prob_es = ( $#{$hash_es{$string_es}} + 1) / 6939873;
is better written as: my $prob_es = ( @{ $hash_es{$string_es} } ) / 6939873;
The same goes for MI, I need the intersection between the array of both value arrays for the keys to the two hashes.
Okay. Then I think these refactorings should do the trick and save considerable time and space in the process:
sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my $prob_es = ( @{ $hash_es{ $string_es } } ) / 6939873; my $prob_en = ( @{ $hash_en{ $string_en } } ) / 6939873; my $intersection = Intersection( $hash_es{ $string_es }, $hash_en{ $string_en } ); my $prob_es_en = ( $intersection ) / 6939873; $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1) ) / 1.1 +; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en ) +); return $mi; } sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $_ } for @$refA; ++$counts{ $_ } for @$refB; my $intersects = 0; $counts{ $_ } > 1 and ++$intersects for keys %counts; return $intersects; }
|
|---|