in reply to Dealing with Hash Table
Here's an implementation of "Keep an array of the current top 20 ..." (as suggested by GrandFather):
my @top; while (<DATA>) { my ($id, $score) = split ' '; my $e = $top[-1]; # last/lowest entry in @top if ( !defined($e) || $score > $e->[1] ) { push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; } } for (@top) { say "$_->[0] : $_->[1]"; } __DATA__ ID1 25 ID2 8 ID3 3 ... 42
(where the __DATA__ is just to indicate the format — you'd of course read from your huge file instead)
Update: actually, this wouldn't work ;( — e.g. if you encounter the largest value first So, here's a corrected version (unfortunately less performant, because @top now is sorted for every iteration):
my @top; while (<DATA>) { my ($id, $score) = split ' '; push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; }
(A quick test shows that on my somewhat aged hardware, this takes 280s for a 30_000_000 line file with random 8-digit scores — as opposed to 52s for the previous only-sometimes-correct version (40s of which are for mere reading/splitting). This is still significantly faster, though, than reading the entire data into a hash and sorting that hash. I couldn't test the full 3e+07 lines (as I only have 4G of physical RAM), but a 3e+06 entry hash already took 220s (and 805M memory), and sorting doesn't scale linearly with size.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Dealing with Hash Table
by roboticus (Chancellor) on Aug 07, 2011 at 14:44 UTC | |
|
Re^2: Dealing with Hash Table
by Gulliver (Monk) on Aug 07, 2011 at 18:56 UTC | |
|
Re^2: Dealing with Hash Table
by bart (Canon) on Aug 08, 2011 at 11:24 UTC |