perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks
I have a code as below:
$MI_T=MI($hypo,$text,\%hash_es,\%hash_en); sub MI { my ($string_es,$string_en,$hash_es,$hash_en)=@_; my @array_es= my @array_en = (); @array_es = split ' ', $hash_es{$string_es}; @array_en = split ' ', $hash_en{$string_en}; my $prob_es = ($#array_es+1)/6939873; my $prob_en = ($#array_en+1)/6939873; my $intersection= Intersection(\@array_es,\@array_en); my $prob_es_en= ($intersection)/6939873; $prob_es_en = ($prob_es_en + ($prob_es*$prob_en*0.1))/1.1; my $mi= $prob_es_en * log( $prob_es_en / ($prob_es * $prob_en) ); return $mi; } sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $_ } for @$refA; ++$counts{ $_ } for @$refB; my $intersects = 0; $counts{ $_ } > 1 and ++$intersects for keys %counts; return $intersects; }
my hash is like this: (term is the key and the numbers are value of each term)
term1 1 3 5 8 9 15 90 term2 23 56 789 23 1 54 89 term3 23 345 677 456 23 .... termn 54 6 768 5678 34 56 78
while the number of digits corresponding to each term can vary form 1 to n.
currently it takes a long time to calculate the $MI_T.
(sometimes 20-25 seconds). is there anyway to improve the code in order to calculate it faster? Thanks!

Replies are listed 'Best First'.
Re: improving the speed
by moritz (Cardinal) on Feb 15, 2011 at 09:27 UTC

    First step: profile your code (for example with Devel::NYProf) and see where the actual slowness comes from.

    From a quick glance it seems that your code is linear in the number of elements in each array, so only minor improvements can be made.

    In sub Intersection, if you can assume that $refA and $refB each don't contain a single item twice, you can rewrite the sub to:

    sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $_ } for @$refA; my $intersects = 0; $counts{ $_ } and ++$intersects for @$refB; return $intersects; }
Re: improving the speed
by cdarke (Prior) on Feb 15, 2011 at 11:02 UTC
    Do you realise you are using the global hashes in your subroutine instead of the references being passed?
    @array_es = split ' ', $hash_es{$string_es}; @array_en = split ' ', $hash_en{$string_en};
    It is probably a bad idea to use the same name for your reference as your global hashes. The code should be:
    @array_es = split ' ', $hash_es->{$string_es}; @array_en = split ' ', $hash_en->{$string_en};
    I doubt it would make any difference to the performance however.

    In theory ($#array_es+1) is slightly slower than (@array_es) used in scalar context, but again I doubt it would make much difference.
Re: improving the speed
by BrowserUk (Patriarch) on Feb 15, 2011 at 20:36 UTC

    Switching from using a hash to a bit vector to perform the intersection saves another 40%; so 5 seconds rather than 25:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my $n_array_es = $hash_es->{ $string_es } =~ tr[ ][ ]; ++$n_array_es; my $n_array_en = $hash_en->{ $string_en } =~ tr[ ][ ]; ++$n_array_en; my $prob_es = ( $n_array_es ) / 6939873; my $prob_en = ( $n_array_en ) / 6939873; my $intersection = Intersection( \$hash_es->{ $string_es }, \$hash_en->{ $string_en } ); my $prob_es_en= $intersection / 6939873; $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1 ) ) / 1. +1; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) ) +; return $mi; } sub Intersection { my( $refA, $refB ) = @_; my $bits = ''; vec( $bits, $1, 1 ) = 1 while $$refA =~ m[(\S+)]g; my $intersects = 0; vec( $bits, $1, 1 ) && ++$intersects while $$refB =~ m[(\S+)]g; return $intersects; } our $N //= 1e4; my $hypo = 'fred'; my $text = 'bill'; my %hash_es; $hash_es{ $hypo } = join ' ', 1 .. $N; my %hash_en; $hash_en{ $text } = join ' ', 1 .. $N; my $start = time; my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en ); printf "Took: %f seconds\n", time() - $start; __END__ C:\test>888162 -N=3e6 Took: 5.966000 seconds

    Any further savings would probably need to come from where you build the lists of line numbers. Ie. If you built up a bit vector for each keyword where a set bit indicated that keyword was found in that line rather than concatenating the line numbers into a string. You can the count the populations using

    my $n_es = unpack '%32b*', $hash_es->{ $string_es }; my $n_en = unpack '%32b*', $hash_en->{ $string_en };

    And perform the intersection using:

    my $intersections = unpack '%32b*', $hash_es->{ $string_es } & $hash_es->{ $string_es };

    Both of which are extremely fast operations. If you put that all together, then you reduce the 25 seconds to 2 milliseconds, which is roughly 125,000 times faster:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my $n_es = unpack '%32b*', $hash_es->{ $string_es }; my $n_en = unpack '%32b*', $hash_en->{ $string_en }; my $prob_es = ( $n_es ) / 6939873; my $prob_en = ( $n_en ) / 6939873; my $intersection = unpack '%32b*', $hash_es->{ $string_es } & $hash_es->{ $string_es }; my $prob_es_en= $intersection / 6939873; $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1 ) ) / 1. +1; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) ) +; return $mi; } our $N //= 1e4; my $hypo = 'fred'; my $text = 'bill'; my %hash_es; $hash_es{ $hypo } = ''; vec( $hash_es{ $hypo }, $_, 1 ) = 1 for 1 .. $N; my %hash_en; $hash_en{ $text } = ''; vec( $hash_en{ $text }, $_, 1 ) = 1 for 1 .. $N; my $start = time; my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en ); printf "Took: %f seconds\n", time() - $start; __END__ C:\test>888162 -N=3e6 Took: 0.001901 seconds

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: improving the speed
by BrowserUk (Patriarch) on Feb 15, 2011 at 18:08 UTC
    while the number of digits corresponding to each term can vary form 1 to n. currently it takes a long time to calculate the $MI_T. (sometimes 20-25 seconds). is there anyway to improve the code in order to calculate it faster?

    The size of the strings (values) in your hashes must be quite huge. I knocked up a simple test rig and found that each hash value had to contain ~3 million line numbers for it to take 25 seconds:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my @array_es = split ' ', $hash_es->{ $string_es }; my @array_en = split ' ', $hash_en->{ $string_en }; my $prob_es = ( @array_es ) / 6939873; my $prob_en = ( @array_en ) / 6939873; my $intersection = Intersection( \@array_es, \@array_en ); my $prob_es_en= $intersection / 6939873; $prob_es_en = ($prob_es_en + ($prob_es * $prob_en * 0.1 ) ) / 1.1; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) ) +; return $mi; } sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $_ } for @$refA; ++$counts{ $_ } for @$refB; my $intersects = 0; $counts{ $_ } > 1 and ++$intersects for keys %counts; return $intersects; } our $N //= 1e4; my $hypo = 'fred'; my $text = 'bill'; my %hash_es; $hash_es{ $hypo } = join ' ', 1 .. $N; my %hash_en; $hash_en{ $text } = join ' ', 1 .. $N; my $start = time; my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en ); printf "Took: %f seconds\n", time() - $start; __END__ C:\test>888162 Took: 0.046187 seconds C:\test>888162 -N=1e5 Took: 0.677000 seconds C:\test>888162 -N=1e6 Took: 7.680000 seconds C:\test>888162 -N=2e6 Took: 15.748000 seconds C:\test>888162 -N=3e6 Took: 25.244000 seconds

    What I noticed is that the vast majority of that time is spent allocating memory. For the arrays into which you split the line numbers in order to count them; and the hash that you need to intersect them. Is it possible to avoid some or all of that memory allocation?

    You first split the strings into arrays, in order to count the number of line numbers:

    my @array_es = split ' ', $hash_es->{ $string_es }; my $prob_es = ( @array_es ) / 6939873;

    If you ensure that the numbers in the string are separated by exactly 1 space, then you can obtain the counts of numbers very quickly by counting the number of spaces and adding 1:

    my $n_array_es = $hash_es->{ $string_es } =~ tr[ ][ ]; ++$n_array_es;

    Of course, that creates a different problem later when it comes time to intersect those sets of numbers. So rather than creating two arrays (with the associated memory allocations) in order to do the intersection, I tried using m//g to iterate the numbers and so set values in the hash:

    sub Intersection { my( $refA, $refB ) = @_; my %counts; ++$counts{ $1 } while $$refA =~ m[(\S+)]g; my $intersects = 0; exists $counts{ $1 } and ++$intersects while $$refB =~ m[(\S+)]g; return $intersects; }

    Putting that together:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub MI { my( $string_es, $string_en, $hash_es, $hash_en ) = @_; my $n_array_es = $hash_es->{ $string_es } =~ tr[ ][ ]; ++$n_array_es; my $n_array_en = $hash_en->{ $string_en } =~ tr[ ][ ]; ++$n_array_en; my $prob_es = ( $n_array_es ) / 6939873; my $prob_en = ( $n_array_en ) / 6939873; ## Notice I am passing references to the hash values here! my $intersection = Intersection( \$hash_es->{ $string_es }, \$hash_en->{ $string_en } ); my $prob_es_en= $intersection / 6939873; $prob_es_en = ( $prob_es_en + ( $prob_es * $prob_en * 0.1 ) ) / 1. +1; my $mi = $prob_es_en * log( $prob_es_en / ( $prob_es * $prob_en) ) +; return $mi; } sub Intersection { my( $refA, $refB ) = @_; // And dereferencing them here!! my %counts; ++$counts{ $1 } while $$refA =~ m[(\S+)]g; my $intersects = 0; exists $counts{ $1 } and ++$intersects while $$refB =~ m[(\S+)]g; return $intersects; } our $N //= 1e4; my $hypo = 'fred'; my $text = 'bill'; my %hash_es; $hash_es{ $hypo } = join ' ', 1 .. $N; my %hash_en; $hash_en{ $text } = join ' ', 1 .. $N; my $start = time; my $MI_T = MI( $hypo, $text, \%hash_es, \%hash_en ); printf "Took: %f seconds\n", time() - $start; __END__ C:\test>888162 Took: 0.103639 seconds C:\test>888162 -N=1e6 Took: 4.962000 seconds C:\test>888162 -N=3e6 Took: 15.616000 seconds

    And it knocked about 40% off the time. Is that enough?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: improving the speed
by Anonymous Monk on Feb 15, 2011 at 09:26 UTC
    ...my hash is like this ... currently it takes a long time to calculate the $MI_

    You really should post a complete program which demonstrates exactly that :) How do I post a question effectively?

Re: improving the speed
by hbm (Hermit) on Feb 15, 2011 at 16:48 UTC

    With the same assumption as moritz's (that neither list has duplicate entries), you might try this:

    sub Intersection { my ($refA, $refB) = @_; my %B; @B{@$refB} = (1) x @$refB; # hash the second list my $intersects; for(@$refA) { ++$intersects if exists $B{$_} # 'exists' is fast! } return $intersects; }