in reply to Dealing with Hash Table

Here's an implementation of "Keep an array of the current top 20 ..." (as suggested by GrandFather):

my @top; while (<DATA>) { my ($id, $score) = split ' '; my $e = $top[-1]; # last/lowest entry in @top if ( !defined($e) || $score > $e->[1] ) { push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; } } for (@top) { say "$_->[0] : $_->[1]"; } __DATA__ ID1 25 ID2 8 ID3 3 ... 42

(where the __DATA__ is just to indicate the format — you'd of course read from your huge file instead)

Update: actually, this wouldn't work ;( — e.g. if you encounter the largest value first   So, here's a corrected version (unfortunately less performant, because @top now is sorted for every iteration):

my @top; while (<DATA>) { my ($id, $score) = split ' '; push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; }

(A quick test shows that on my somewhat aged hardware, this takes 280s for a 30_000_000 line file with random 8-digit scores — as opposed to 52s for the previous only-sometimes-correct version (40s of which are for mere reading/splitting).   This is still significantly faster, though, than reading the entire data into a hash and sorting that hash. I couldn't test the full 3e+07 lines (as I only have 4G of physical RAM), but a 3e+06 entry hash already took 220s (and 805M memory), and sorting doesn't scale linearly with size.)

Replies are listed 'Best First'.
Re^2: Dealing with Hash Table
by roboticus (Chancellor) on Aug 07, 2011 at 14:44 UTC

    Eliya:

    I've modified your solution a bit, as I was uncomfortable with sorting the array for every iteration. For some data sets (very large datasets and/or skewed distributions) the sorting shouldn't be required frequently.

    #!/usr/bin/perl use strict; use warnings; use autodie; use feature ':5.10'; my $top_N = 10; my $gen_score = sub { 1000 * rand() * rand () }; my $noisy = 0; # 0=quiet, 1=show repl, 2=show add, 3=all traces # Generate temp data if (@ARGV or ! -e 'TEMP') { my $row_count = shift // 100; say "Building temp file with $row_count items" if $noisy; open my $FH, '>', 'TEMP'; printf $FH "ID_%04u %.2f\n", $_, &$gen_score() for 1 .. $row_count +; close $FH; } my @top; my ($cnt_add, $cnt_replace) = (0)x2; open my $FH, '<', 'TEMP'; while (<$FH>) { my ($id, $score) = split ' '; if (@top<$top_N) { @top = sort { $a->[1] <=> $b->[1] } @top, [ $id, $score]; say "$.: [$id, $score] ADD => ", scalar(@top), " $top[0][0]:$top[0][1] .. $top[-1][0]:$top[-1][1]" if $noisy > 1; ++$cnt_add; } elsif ($score > $top[0][1]) { @top = sort { $a->[1] <=> $b->[1] } [$id, $score], @top[1 .. $ +#top]; say "$.: [$id, $score] REPL => ", scalar(@top), " $top[0][0]:$top[0][1] .. $top[-1][0]:$top[-1][1]" if $noisy; ++$cnt_replace; } else { say "$.: [$id, $score] < [ $top[0][0], $top[0][1] ]" if $noisy > 2; } } for (@top) { say "$_->[0] : $_->[1]"; } say "adds: $cnt_add, replacements: $cnt_replace, lines: $.";

    A sample run:

    $ perl 919076.pl 1500 ID_0762 : 897.90 ID_1218 : 898.96 ID_1468 : 917.94 ID_0195 : 920.68 ID_0089 : 921.92 ID_0071 : 925.55 ID_0668 : 933.69 ID_1425 : 940.34 ID_0374 : 962.16 ID_0185 : 984.86 adds: 10, replacements: 45, lines: 1500

    Note: Yes, (a) Premature optimization is the root of all evil, (b) I'm not claiming it's faster, as I haven't benchmarked it, and (c) I'm not claiming that yours is "too slow", either. I'm was just scratching an itch. ;^)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^2: Dealing with Hash Table
by Gulliver (Monk) on Aug 07, 2011 at 18:56 UTC

    Eliya, I liked your first solution. All it needs is a way to initialize the top 20 array with the first 20 data points. Instead of checking if $e is not defined check if @top has fewer than 20 elements.

    if ( @top < 20 || $score > $e->[1] ) {
Re^2: Dealing with Hash Table
by bart (Canon) on Aug 08, 2011 at 11:24 UTC
    robiticus has the right idea, in my eye: you don't have to sort on every iteration. However his code looks complex to me and at first sight I still get the impression he still sorts for every entry.

    What I think you should do is collect is larger number of "top" items, and once you get over a threshold, say, if @top contains 100 entries where you need only 20, you can sort and only keep the top 20, and drop the 80 items below it. That should divide your count of sorts by 80, as it would only sort again once you added 80 more items; or (possibly a lot) more, if you only keep the items that are not above the score of the current item #20. Still, a factor 80 is already significant, even though the sort itself will be slower. (Sorting 100 items should take, ooh, 6 or 7 times longer than sorting 20 items, I guess.)

    Here's a proof of concept implementation (untested) for both ideas:

    my $N = 20; my $collect = 5 * $N; while (<>) { my ($id, $score) = split ' '; push @top, [$id, $score]; if(@top > $collect) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; } } if(@top > $N) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; }
    and with threshold:
    my $N = 20; my $collect = 5 * $N; my $threshold; while (<>) { my ($id, $score) = split ' '; push @top, [$id, $score] if !defined $threshold or $score >= $thre +shold; if(@top > $collect) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; $threshold = $top[-1][1]; } } if(@top > $N) { @top = sort { $b->[1] <=> $a->[1] } @top; splice @top, $N; }