ŞuRvīvőr has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading a huge file ( ~30 Million line) extracting an ID with a corresponding score from each line, saving them in a hash table $my_hash{$ID} = $score. How can I extract the highest 20 scores out of the hash table along with their IDs ??

Replies are listed 'Best First'.
Re: Dealing with Hash Table
by Eliya (Vicar) on Aug 07, 2011 at 13:12 UTC

    Here's an implementation of "Keep an array of the current top 20 ..." (as suggested by GrandFather):

    my @top; while (<DATA>) { my ($id, $score) = split ' '; my $e = $top[-1]; # last/lowest entry in @top if ( !defined($e) || $score > $e->[1] ) { push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; } } for (@top) { say "$_->[0] : $_->[1]"; } __DATA__ ID1 25 ID2 8 ID3 3 ... 42

    (where the __DATA__ is just to indicate the format — you'd of course read from your huge file instead)

    Update: actually, this wouldn't work ;( — e.g. if you encounter the largest value first   So, here's a corrected version (unfortunately less performant, because @top now is sorted for every iteration):

    my @top; while (<DATA>) { my ($id, $score) = split ' '; push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; }

    (A quick test shows that on my somewhat aged hardware, this takes 280s for a 30_000_000 line file with random 8-digit scores — as opposed to 52s for the previous only-sometimes-correct version (40s of which are for mere reading/splitting).   This is still significantly faster, though, than reading the entire data into a hash and sorting that hash. I couldn't test the full 3e+07 lines (as I only have 4G of physical RAM), but a 3e+06 entry hash already took 220s (and 805M memory), and sorting doesn't scale linearly with size.)

      Eliya:

      I've modified your solution a bit, as I was uncomfortable with sorting the array for every iteration. For some data sets (very large datasets and/or skewed distributions) the sorting shouldn't be required frequently.

      #!/usr/bin/perl use strict; use warnings; use autodie; use feature ':5.10'; my $top_N = 10; my $gen_score = sub { 1000 * rand() * rand () }; my $noisy = 0; # 0=quiet, 1=show repl, 2=show add, 3=all traces # Generate temp data if (@ARGV or ! -e 'TEMP') { my $row_count = shift // 100; say "Building temp file with $row_count items" if $noisy; open my $FH, '>', 'TEMP'; printf $FH "ID_%04u %.2f\n", $_, &$gen_score() for 1 .. $row_count +; close $FH; } my @top; my ($cnt_add, $cnt_replace) = (0)x2; open my $FH, '<', 'TEMP'; while (<$FH>) { my ($id, $score) = split ' '; if (@top<$top_N) { @top = sort { $a->[1] <=> $b->[1] } @top, [ $id, $score]; say "$.: [$id, $score] ADD => ", scalar(@top), " $top[0][0]:$top[0][1] .. $top[-1][0]:$top[-1][1]" if $noisy > 1; ++$cnt_add; } elsif ($score > $top[0][1]) { @top = sort { $a->[1] <=> $b->[1] } [$id, $score], @top[1 .. $ +#top]; say "$.: [$id, $score] REPL => ", scalar(@top), " $top[0][0]:$top[0][1] .. $top[-1][0]:$top[-1][1]" if $noisy; ++$cnt_replace; } else { say "$.: [$id, $score] < [ $top[0][0], $top[0][1] ]" if $noisy > 2; } } for (@top) { say "$_->[0] : $_->[1]"; } say "adds: $cnt_add, replacements: $cnt_replace, lines: $.";

      A sample run:

      $ perl 919076.pl 1500 ID_0762 : 897.90 ID_1218 : 898.96 ID_1468 : 917.94 ID_0195 : 920.68 ID_0089 : 921.92 ID_0071 : 925.55 ID_0668 : 933.69 ID_1425 : 940.34 ID_0374 : 962.16 ID_0185 : 984.86 adds: 10, replacements: 45, lines: 1500

      Note: Yes, (a) Premature optimization is the root of all evil, (b) I'm not claiming it's faster, as I haven't benchmarked it, and (c) I'm not claiming that yours is "too slow", either. I'm was just scratching an itch. ;^)

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      Eliya, I liked your first solution. All it needs is a way to initialize the top 20 array with the first 20 data points. Instead of checking if $e is not defined check if @top has fewer than 20 elements.

      if ( @top < 20 || $score > $e->[1] ) {
      robiticus has the right idea, in my eye: you don't have to sort on every iteration. However his code looks complex to me and at first sight I still get the impression he still sorts for every entry.

      What I think you should do is collect is larger number of "top" items, and once you get over a threshold, say, if @top contains 100 entries where you need only 20, you can sort and only keep the top 20, and drop the 80 items below it. That should divide your count of sorts by 80, as it would only sort again once you added 80 more items; or (possibly a lot) more, if you only keep the items that are not above the score of the current item #20. Still, a factor 80 is already significant, even though the sort itself will be slower. (Sorting 100 items should take, ooh, 6 or 7 times longer than sorting 20 items, I guess.)

      Here's a proof of concept implementation (untested) for both ideas:

      my $N = 20; my $collect = 5 * $N; while (<>) { my ($id, $score) = split ' '; push @top, [$id, $score]; if(@top > $collect) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; } } if(@top > $N) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; }
      and with threshold:
      my $N = 20; my $collect = 5 * $N; my $threshold; while (<>) { my ($id, $score) = split ' '; push @top, [$id, $score] if !defined $threshold or $score >= $thre +shold; if(@top > $collect) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; $threshold = $top[-1][1]; } } if(@top > $N) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; }
Re: Dealing with Hash Table
by GrandFather (Saint) on Aug 07, 2011 at 12:08 UTC

    Keep an array of the current top 20 as you read the values in.

    That said, depending on the nature of the data and what else you are doing with it, you may be better off using a database such as DBD::SQLite.

    True laziness is hard work

      I have done the following ...

      @keys = sort {$Top_Twenty_Categories{$b} <=> $Top_Twenty_Categories{$a +}} keys %Top_Twenty_Categories; foreach $key (@keys) { $key: $Top_Twenty_Categories{$key}\n"; }

      And it's working on a sample dataset very well, but I'm trying to find a faster way to do the job; as I'm gonna apply it on a very large dataset.

Re: Dealing with Hash Table
by AnomalousMonk (Archbishop) on Aug 07, 2011 at 20:55 UTC

    Assuming the database approach is not appropriate, isn't this what the OS sort is for? Why not just reverse-sort the file on the score field, then extract the 'top 20' scores (which may or may not be the first 20 lines) of the sorted file with a nearly-trivial Perl script? (If 'top 20' really is the first 20 lines, extraction could even be done with head.) If an ID may be present more than once with different scores, a slightly-less-trivial Perl script would be needed to extract the IDs with the 'top 20' scores, however defined.

Re: Dealing with Hash Table
by BrowserUk (Patriarch) on Aug 07, 2011 at 19:40 UTC

    A one-liner:

    perl -nle"@t=(sort{(split' ',$b)[1]<=>(split' ',$a)[1]}@t,$_)[0..19]}{print + for @t" junk.dat

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.