garbage777 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a 120000 line text file called "chance.txt" containing two columns as below:
k1 0.2 k2 0.9 k3 1 k4 1.2 k5 1.3 k6 2 k7 2.2 k8 2.7 k9 2.76 k10 3.001 k11 3.12 k12 3.2 k14 3.22 k15 3.3 k16 3.34 k17 3.4 k18 3.62 k19 3.8 k20 3.89

I have to print a random hash value and the associated key if it satisfies certain conditions shown below. Below is the code I wrote.
#!/usr/bin/perl use strict; use warnings; my %netHash; my $capval; my ($capVal_1_Hash, $capVal_2_Hash, $capVal_1_Hash, ....., $capVal_9 +9_Hash, $capVal_1_Hash) ; open(IFH, "<", "./chance.txt"); while(<IFH>) { chomp; my($nx,$ny) = split; $netHash{$nx} = $ny if ($ny > 1); } foreach $val (keys %netHash) { if($netHash{$val} > 1.0) { if($netHash{$val} < 1.4) { my ($x1, $y1) = ($val,$netHash{$val}); $capVal_1_Hash{$x1} = $y1 if defined $y1; } } if($netHash{$val} > 1.4) { if($netHash{$val} < 1.9) { my ($x2, $y2) = ($val,$netHash{$val}); $capVal_2_Hash{$x2} = $y2 if defined $y2; } } ... ... if($netHash{$val} > 99.2) { if($netHash{$val} <99.8) { my ($x99, $y99) = ($val,$netHash{$val}); $capVal_99_Hash{$x99} = $y99 if defined $y99; } } if($netHash{$val} > 100.1) { if($netHash{$val} < 100.9) { my ($x100, $y100) = ($val,$netHash{$val}); $capVal_100_Hash{$x100} = $y100 if defined $y100; } } } printHash(\%capVal_1_Hash); printHash(\%capVal_2_Hash); ... ... printHash(\%capVal_99_Hash); printHash(\%capVal_100_Hash); close IFH; } sub printHash { my $href = shift; my $RandKey = (keys %{$href}) [int(rand(scalar (keys %{$href})))]; print "$RandKey $href->{$RandKey}\n"; }

If I try to sample large files with this code, it looks like it is very slow. Therefore allow me to ask if there is any better way to write this code. Any help from fellow monks is appreciated.
best,
garbage777

Replies are listed 'Best First'.
Re: better way to printing random hash key and its value
by merlyn (Sage) on Dec 23, 2006 at 04:54 UTC
    Any time you have repeated copies of similar code, and variable names that contain a sequential integer, it should be a clue that you really have a data structure. It seems that what you want is an array (or perhaps a hash) above your other hashes so that you have one structure over which you can iterate.
Re: better way to printing random hash key and its value
by GrandFather (Saint) on Dec 23, 2006 at 06:02 UTC

    If you find yourself pulling out a rubber stamp then it is time to either use a subroutine or a data structure. In this case the key parts of the solution are to concatenate the ranges used for partitioning into an array and put the result of the partitioning into an array.

    #!/usr/bin/perl use strict; use warnings; my %netHash; my @capVals; my @limits = (1.0, 1.4, 1.9, 2.4, 3.0, 3.5, 4.0); while(<DATA>) { chomp; my($nx,$ny) = split; $netHash{$nx} = $ny if ($ny > 1); } foreach my $key (keys %netHash) { my $value = $netHash{$key}; my $index = BinSearch ($value, sub {$_[0] <=> $_[1]}, \@limits); next if $index < 0; $capVals[$index]{$key} = $value; } printHash ($_) for @capVals; sub printHash { my $href = shift; return if ! defined $href; my $RandKey = (keys %{$href}) [int(rand(scalar (keys %{$href})))]; print "$RandKey $href->{$RandKey}\n"; } sub BinSearch { my ($target, $cmp, $array) = @_; my $posmin = 0; my $posmax = $#$array; return undef if ! @$array; return -0.5 if $cmp->($array->[0], $target) > 0; return $#$array + 0.5 if $cmp->($array->[-1], $target) < 0; while (1) { my $mid = int (($posmin + $posmax) / 2); my $result = $cmp->($array->[$mid], $target); if ($result < 0) { $posmin = $posmax, next if $mid == $posmin && $posmax != $ +posmin; return $mid + 0.5 if $mid == $posmin; $posmin = $mid; } elsif ($result > 0) { $posmax = $posmin, next if $mid == $posmax && $posmax != $ +posmin; return $mid - 0.5 if $mid == $posmax; $posmax = $mid; } else { return $mid; } } } __DATA__ k1 0.2 k2 0.9 k3 1 k4 1.2 k5 1.3 k6 2 k7 2.2 k8 2.7 k9 2.76 k10 3.001 k11 3.12 k12 3.2 k14 3.22 k15 3.3 k16 3.34 k17 3.4 k18 3.62 k19 3.8 k20 3.89

    Prints:

    k4 1.2 k7 2.2 k9 2.76 k11 3.12 k19 3.8

    This code assumes that there are not supposed to be gaps between the partitioned ranges. Upper and lower limit pairs would need to be stored and the search modified if the ranges truely are gappy.


    DWIM is Perl's answer to Gödel
Re: better way to printing random hash key and its value
by kyle (Abbot) on Dec 23, 2006 at 05:28 UTC

    Well, there are a few places that it could be more efficient.

    The very first if in your foreach loop tests for a condition ($netHash{$val} > 1) that's guaranteed by your input loop.

    Inside your conditions, you check that the value from $netHash is defined before assignment, but that's also guaranteed by your input loop.

    That list of if conditions in the body of foreach all look mutually exclusive, and you're essentially doing "and" with nested conditions. You could rewrite this way:

    # you can get rid of the ($netHash{$val} > 1.0) test # I only included it to make the example more like yours if($netHash{$val} > 1.0 && $netHash{$val} < 1.4) { my ($x1, $y1) = ($val,$netHash{$val}); $capVal_1_Hash{$x1} = $y1; } elsif($netHash{$val} > 1.4 && $netHash{$val} < 1.9) { my ($x2, $y2) = ($val,$netHash{$val}); $capVal_2_Hash{$x2} = $y2; } ... ...

    That way, you're not checking that a value is between 1.4 and 1.9 after you've already established that it's between 1.0 and 1.4.

    I notice you seem to be excluding some values (such as 1.4), and I wonder if that's intentional.

    Your printHash might be faster if you don't call keys twice. Like so:

    sub printHash { my $href = shift; my @href_keys = keys %{$href}; my $RandKey = $href_keys[int(rand(scalar @href_keys))]; print "$RandKey $href->{$RandKey}\n"; }

    You seem to have a hundred separate hashes, %capVal_xx_Hash. I'd make those one hash of hashes, but I don't think that will gain you any speed. In that case your $capVal_100_Hash{$x100} = $y100 would be $capVal_Hash{100}{$x100} = $y100, just as an example. Then your repeated calls to printHash at the end could be in a loop:

    foreach my $n ( 1 .. 100 ) { printHash( $capVal_hash{$n} ); }

    In that case, you could even factor out the sub printHash and just put it in the loop body, if you wanted. That might be faster (it saves a function call).

    If you want to make the code shorter still (y'know, if you're not getting paid by the line), you could put your conditions in an array to loop over. (This is instead of the if ... elsif construct I suggested above.)

    my @range_destination = ( [ 1.0, 1.4, 1 ], [ 1.4, 1.9, 2 ], ... [ 99.2, 99.8, 99 ], [100.1, 100.9, 100 ], ); # what you call $val is really a KEY, # but I'll keep your name anyway. while ( my ( $val, $real_val ) = each %netHash ) { CONDITION: foreach my $cond_ref ( @range_destination ) { my ( $min, $max, $dest ) = @$cond_ref; if ( $val > $min && $val < $max ) { $capVal_Hash{$dest}{$val} = $real_val; last CONDITION; } } }

    Using each is more memory efficient. Putting all your ranges in one places makes it easier to see what they are (though I can't tell if that's significant), and it looks like it might be a few hundred lines shorter.

    Anyway, hope these suggestions help.

      You could...

      LINE: while(<IFH>) { chomp; my($nx,$ny) = split; next LINE if ($ny <= 1.0); CONDITION: foreach my $cond_ref ( @range_destination ) { my ( $min, $max, $dest ) = @$cond_ref; if ( $nx > $min && $nx < $max ) { $capVal_Hash{$dest}{$nx} = $ny; last CONDITION; } } }

      One loop instead of two will save some time, I reckon.

      If you know the frequency of values in your input, you can also order the @range_destination list so that the more frequent conditions are first, short circuiting the CONDITION loop faster.

Re: better way to printing random hash key and its value
by Util (Priest) on Dec 23, 2006 at 06:28 UTC

    The biggest time sink I see is that you look up $netHash{$val} 100 times in each loop.

    Untested code, via top-of-my-head refactoring:

    #!/usr/bin/perl use strict; use warnings; # Do you really need named hashes for later code? my @capVal_hashes = ( undef, \( my %capVal_1_Hash ), \( my %capVal_2_Hash ), \( my %capVal_3_Hash ), ... \( my %capVal_99_Hash ), \( my %capVal_100_Hash ), ); open IFH, '<', './chance.txt' or die; my %netHash; while (<IFH>) { my ( $key, $val ) = split; next if not $val > 1; my $hash_num = ( $val <= 1.0 ) ? undef : ( $val < 1.4 ) ? 1 : ( $val == 1.4 ) ? undef # Intentional gaps? : ( $val < 1.9 ) ? 2 : ( $val == 1.9 ) ? undef # Intentional gaps? ... ... : ( $val < 99.2 ) ? 98 # XXX Assumed : ( $val == 99.2 ) ? undef # Intentional gaps? : ( $val < 99.8 ) ? 99 : ( $val <= 100.1 ) ? undef # Intentional gaps? : ( $val < 100.9 ) ? 100 : undef ; if ( defined $hash_num ) { $capVal_hashes[$hash_num]{$key} = $val; } } close IFH or warn; foreach my $href ( @capVal_hashes ) { next if not defined $href; my @keys = keys %{ $href }; my $random_key = @keys[ rand @keys ]; my $random_val = $href->{$random_key}; print "$random_key $random_val\n"; }