in reply to Re^11: statistics of a large text
in thread statistics of a large text

Thanks a lot! very good points! I agree for the "sub to_hash" but for the "sub MI", I dont want the count of keys for hashes. instead of
my $prob_es = ( keys %$hash_es ) / 6939873; my $prob_en = ( keys %$hash_en ) / 6939873;
I want :
my $prob_es = ( $#{$hash_es{$string_es}} + 1) / 6939873; my $prob_en = ( $#{$hash_es{$string_en}} + 1 ) / 6939873;
which is the number of elements of the value array of each key!

The same goes for MI, I need the intersection between the array of both value arrays for the keys to thw two hashes.

Replies are listed 'Best First'.
Re^13: statistics of a large text
by BrowserUk (Patriarch) on Feb 11, 2011 at 09:47 UTC
    the number of elements of the value array of each key!

    This: my $prob_es = ( $#{$hash_es{$string_es}} + 1) / 6939873;

    is better written as: my $prob_es = ( @{ $hash_es{$string_es} }  ) / 6939873;

    The same goes for MI, I need the intersection between the array of both value arrays for the keys to the two hashes.

    Okay. Then I think these refactorings should do the trick and save considerable time and space in the process:

    sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my $prob_es = ( @{ $hash_es{ $string_es } } ) / 6939873; my $prob_en = ( @{ $hash_en{ $string_en } } ) / 6939873; my $intersection = Intersection( $hash_es{ $string_es }, $hash_en{ $string_en } ); my $prob_es_en = ( $intersection ) / 6939873; $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1) ) / 1.1 +; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en ) +); return $mi; } sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $_ } for @$refA; ++$counts{ $_ } for @$refB; my $intersects = 0; $counts{ $_ } > 1 and ++$intersects for keys %counts; return $intersects; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.