Here's an implementation of "Keep an array of the current top 20 ..." (as suggested by GrandFather):
my @top; while (<DATA>) { my ($id, $score) = split ' '; my $e = $top[-1]; # last/lowest entry in @top if ( !defined($e) || $score > $e->[1] ) { push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; } } for (@top) { say "$_->[0] : $_->[1]"; } __DATA__ ID1 25 ID2 8 ID3 3 ... 42
(where the __DATA__ is just to indicate the format — you'd of course read from your huge file instead)
Update: actually, this wouldn't work ;( — e.g. if you encounter the largest value first So, here's a corrected version (unfortunately less performant, because @top now is sorted for every iteration):
my @top; while (<DATA>) { my ($id, $score) = split ' '; push @top, [$id, $score]; @top = sort {$b->[1] <=> $a->[1]} @top; pop @top if @top > 20; }
(A quick test shows that on my somewhat aged hardware, this takes 280s for a 30_000_000 line file with random 8-digit scores — as opposed to 52s for the previous only-sometimes-correct version (40s of which are for mere reading/splitting). This is still significantly faster, though, than reading the entire data into a hash and sorting that hash. I couldn't test the full 3e+07 lines (as I only have 4G of physical RAM), but a 3e+06 entry hash already took 220s (and 805M memory), and sorting doesn't scale linearly with size.)
In reply to Re: Dealing with Hash Table
by Eliya
in thread Dealing with Hash Table
by ŞuRvīvőr
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |