while the number of digits corresponding to each term can vary form 1 to n. currently it takes a long time to calculate the $MI_T. (sometimes 20-25 seconds). is there anyway to improve the code in order to calculate it faster?
The size of the strings (values) in your hashes must be quite huge. I knocked up a simple test rig and found that each hash value had to contain ~3 million line numbers for it to take 25 seconds:
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
sub MI {
my( $string_es, $string_en, $hash_es, $hash_en ) = @_;
my @array_es = split ' ', $hash_es->{ $string_es };
my @array_en = split ' ', $hash_en->{ $string_en };
my $prob_es = ( @array_es ) / 6939873;
my $prob_en = ( @array_en ) / 6939873;
my $intersection = Intersection( \@array_es, \@array_en );
my $prob_es_en= $intersection / 6939873;
$prob_es_en = ($prob_es_en + ($prob_es * $prob_en * 0.1 ) ) / 1.1;
my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) )
+;
return $mi;
}
sub Intersection {
my( $refA, $refB ) = @_;
my %counts;
++$counts{ $_ } for @$refA;
++$counts{ $_ } for @$refB;
my $intersects = 0;
$counts{ $_ } > 1 and ++$intersects for keys %counts;
return $intersects;
}
our $N //= 1e4;
my $hypo = 'fred';
my $text = 'bill';
my %hash_es; $hash_es{ $hypo } = join ' ', 1 .. $N;
my %hash_en; $hash_en{ $text } = join ' ', 1 .. $N;
my $start = time;
my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en );
printf "Took: %f seconds\n", time() - $start;
__END__
C:\test>888162
Took: 0.046187 seconds
C:\test>888162 -N=1e5
Took: 0.677000 seconds
C:\test>888162 -N=1e6
Took: 7.680000 seconds
C:\test>888162 -N=2e6
Took: 15.748000 seconds
C:\test>888162 -N=3e6
Took: 25.244000 seconds
What I noticed is that the vast majority of that time is spent allocating memory. For the arrays into which you split the line numbers in order to count them; and the hash that you need to intersect them. Is it possible to avoid some or all of that memory allocation?
You first split the strings into arrays, in order to count the number of line numbers:
my @array_es = split ' ', $hash_es->{ $string_es };
my $prob_es = ( @array_es ) / 6939873;
If you ensure that the numbers in the string are separated by exactly 1 space, then you can obtain the counts of numbers very quickly by counting the number of spaces and adding 1: my $n_array_es = $hash_es->{ $string_es } =~ tr[ ][ ];
++$n_array_es;
Of course, that creates a different problem later when it comes time to intersect those sets of numbers. So rather than creating two arrays (with the associated memory allocations) in order to do the intersection, I tried using m//g to iterate the numbers and so set values in the hash: sub Intersection {
my( $refA, $refB ) = @_;
my %counts;
++$counts{ $1 } while $$refA =~ m[(\S+)]g;
my $intersects = 0;
exists $counts{ $1 } and ++$intersects while $$refB =~ m[(\S+)]g;
return $intersects;
}
Putting that together: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
sub MI {
my( $string_es, $string_en, $hash_es, $hash_en ) = @_;
my $n_array_es = $hash_es->{ $string_es } =~ tr[ ][ ];
++$n_array_es;
my $n_array_en = $hash_en->{ $string_en } =~ tr[ ][ ];
++$n_array_en;
my $prob_es = ( $n_array_es ) / 6939873;
my $prob_en = ( $n_array_en ) / 6939873;
## Notice I am passing references to the hash values here!
my $intersection = Intersection(
\$hash_es->{ $string_es },
\$hash_en->{ $string_en }
);
my $prob_es_en= $intersection / 6939873;
$prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1 ) ) / 1.
+1;
my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) )
+;
return $mi;
}
sub Intersection {
my( $refA, $refB ) = @_;
// And dereferencing them here!!
my %counts;
++$counts{ $1 } while $$refA =~ m[(\S+)]g;
my $intersects = 0;
exists $counts{ $1 } and ++$intersects while $$refB =~ m[(\S+)]g;
return $intersects;
}
our $N //= 1e4;
my $hypo = 'fred';
my $text = 'bill';
my %hash_es; $hash_es{ $hypo } = join ' ', 1 .. $N;
my %hash_en; $hash_en{ $text } = join ' ', 1 .. $N;
my $start = time;
my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en );
printf "Took: %f seconds\n", time() - $start;
__END__
C:\test>888162
Took: 0.103639 seconds
C:\test>888162 -N=1e6
Took: 4.962000 seconds
C:\test>888162 -N=3e6
Took: 15.616000 seconds
And it knocked about 40% off the time. Is that enough?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|