Limit the size of a hash

Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Limit the size of a hash by Corion (Patriarch) on Sep 05, 2013 at 07:02 UTC
How about just keeping the first 10 elements, and whenever you find a new element, just keeping the 10 highest scoring elements? That way you have never more than 11 elements in memory.	[reply]
Re: Limit the size of a hash by BrowserUk (Patriarch) on Sep 05, 2013 at 09:37 UTC
There is an anomaly in what you are seeking to do. And 8GB file of 12 character lines gives ~715 million lines. With the selection value being a value between 0.00 & 0.99, there are only 100 possible values; which means that there could be an average of ~7 million of each value. Which 10 of the 7 million 0.99 value liens do you want? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Limit the size of a hash by brx (Pilgrim) on Sep 05, 2013 at 09:46 UTC
Pigeonhole principle :-) English is not my mother tongue. Les tongues de ma mère sont "made in France".	[reply] [d/l]
Re^3: Limit the size of a hash by BrowserUk (Patriarch) on Sep 05, 2013 at 10:12 UTC
Or perhaps the OP actually wants all the records that contain the top 10 values. Eg. ~7 million * 0.99; ~7 million * 0.98; ... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Limit the size of a hash by nevdka (Pilgrim) on Sep 06, 2013 at 01:46 UTC
That assumes a flat distribution. If it was normally distributed with a sufficiently small standard deviation (0.1 would work, according to the back of my envelope) then talking about the top 10 would make sense.	[reply]
Re: Limit the size of a hash by space_monk (Chaplain) on Sep 05, 2013 at 08:42 UTC
You could use a package such as Tie::Array::Sorted to maintain a sorted array of your 10 highest scores (mapped to the corresponding code); once the array increases in size to above 10, just remove the lowest (last) item on the array as you go through the file. The following needs a little refinement/ debugging, but shows an outline of how I think it should work.... `use Tie::Array::Sorted; tie @a, "Tie::Array::Sorted", sub { $_[0]->{score} <=> $_[1]->{score} +}; # read file line by line while (<FH>) { # use split if you hate regexs... :-) if ( /((\d ){4})(\d.\d+)/) { # add new element to array push @a, { 'score' => $2, 'code' => $1 }; # remove lowest scoring element if more than 10 if ($#a > 10) { pop @a; # remove lowest scoring/last element } } }` [download] If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)	[reply] [d/l]
Re: Limit the size of a hash by Random_Walk (Prior) on Sep 05, 2013 at 08:26 UTC
use strict; use warnings; use Data::Dumper; my $keep = 5; my @keeper; # populate our store with the first $keep vals for (1 .. $keep) { my @vals = split /\s+/, <DATA>; my $score = pop @vals; push @keeper, [$score, \@vals]; } # Sort it so the smallest is at the end @keeper = sort {$a->[0] <=> $b->[0]} @keeper; print "$_->[0], " for @keeper; while (<DATA>) { my @vals = split; next unless @vals; # could do a more sophisticated test my $score = pop @vals; if ($score > $keeper[0]->[0]) { $keeper[0] = [$score, \@vals]; @keeper = sort {$a->[0] <=> $b->[0]} @keeper; print "\nAdded $score: now have: "; print "$_->[0], " for @keeper; } } __DATA__ 0 5 3 7 0.97 1 5 4 8 0.18 1 5 4 8 0.21 1 5 4 8 0.22 1 5 4 8 0.52 1 5 4 8 0.16 1 5 4 8 0.18 1 5 4 8 0.72 1 5 4 8 0.32 1 5 4 8 0.17 1 5 4 8 0.52 1 5 4 8 0.92 6 8 4 6 0.54 [download] Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re: Limit the size of a hash by Tux (Canon) on Sep 05, 2013 at 08:49 UTC
If you need to analyze the data more often than once and speed is not the main issue, but size is, put the hash in a database. Use something like `select … from hashtable order by … limit 10;`. You can fill the database with e.g. Tie::Hash::DBD and have the best of both worlds. Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: Limit the size of a hash by hdb (Monsignor) on Sep 05, 2013 at 11:20 UTC
And here my version based on insertion sort, complete with random data generation and testing: use strict; use warnings; sub nextrecord { return undef unless shift; # generate random records return { code => join( " ", (int rand 10), (int rand 10), (in +t rand 10), (int rand 10) ), score => rand }; } my $keep = 10; my $counter = 1000; # initialize with the first 10 records sorted my @largest = sort { $a->{score} <=> $b->{score} } map nextrecord( $co +unter-- ), 1..$keep; my @test = @largest; # for testing while( my $next = nextrecord( $counter-- ) ){ push @test, $next; # keep all for testing my $pos = 0; # insertion sort while(($pos < $keep) && ($largest[$pos]->{score} < $next->{sco +re}) ) { $largest[$pos-1] = $largest[$pos] if $pos; $pos++; } $largest[$pos-1] = $next if $pos; } print map { "$_->{code}\t$_->{score}\n" } @largest; # test print "\n"; print map { "$_->{code}\t$_->{score}\n" } ( sort { $a->{score} <=> $b- +>{score} } @test )[-10..-1]; [download]	[reply] [d/l]
Re^2: Limit the size of a hash by Random_Walk (Prior) on Sep 05, 2013 at 13:37 UTC
I just wrote my own with insertion sort. Looks like you beat me to it. This one no longer needs to initialise and sort the store array with real data, it just fills it with the correct number of zero values. use strict; use warnings; use Data::Dumper; my $keep = 5; my @keeper; push @keeper, [0] for 1 .. $keep; while (<DATA>) { my @vals = split; next unless @vals; # possibly more sophisticated test my $score = pop @vals; addlist ([$score, \@vals], \@keeper); } print Dumper \@keeper; # slide up the list knocking down existing records # until we no longer have the biggest sub addlist { my ($rec, $list) = @_; for my $i (0 .. $#$list) { if ($rec->[0] > $list->[$i]->[0]) { $list->[$i-1] = $list->[$i] if $i; $list->[$i] = $rec; } else { last; } } } __DATA__ 0 5 3 7 0.97 1 3 4 8 0.18 1 4 4 8 0.21 1 7 4 4 0.22 1 8 4 8 0.52 1 9 4 1 0.16 1 5 1 8 0.18 1 5 2 9 0.72 1 5 5 8 0.32 1 5 6 6 0.17 1 5 7 4 0.52 1 5 9 3 0.92 1 5 4 8 0.99 6 8 4 6 0.54 [download] Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re^3: Limit the size of a hash by hdb (Monsignor) on Sep 05, 2013 at 14:16 UTC
I am usually careful with default values. Here you implicitely assume that all scores are positive, otherwise the zeroes could influence the result...	[reply]
Re^4: Limit the size of a hash by Random_Walk (Prior) on Sep 06, 2013 at 10:21 UTC
Re: Limit the size of a hash by brx (Pilgrim) on Sep 05, 2013 at 09:38 UTC
Similar to Random_Walk version (Re: Limit the size of a hash), but without `sort`. Here, we keep five elements. If necessary, we calculate the index of the lowest element (min of highest). If score of the new element is better than "min of highest", "min of highest" is replaced by new element. use strict; use warnings; my $keep = 5; #init my @highest = map scalar<DATA>,1..$keep; #fisrt lines my @scores = map +(split)[-1],@highest; #score is last element (-1) of + each line my $indexofmin = 0; my $updated = 1; #### while(my $line=<DATA>) { my $score = (split/ /,$line)[-1]; if ($updated) { for my $i (0..$#scores) { $indexofmin = $i if $scores[$i]<$scores[$indexofmin]; } } if ($score>$scores[$indexofmin]) { $highest[$indexofmin]=$line; $scores[$indexofmin]=$score; $updated = 1; } else { $updated = 0; } } print @highest; __DATA__ 0 5 3 7 0.97 1 5 4 8 0.18 1 5 4 8 0.21 1 5 4 8 0.22 1 5 4 8 0.52 1 5 4 8 0.16 1 5 4 8 0.18 1 5 4 8 0.72 1 5 4 8 0.32 1 5 4 8 0.17 1 5 4 8 0.52 1 5 4 8 0.92 6 8 4 6 0.54 [download] English is not my mother tongue. Les tongues de ma mère sont "made in France".	[reply] [d/l] [select]
Re: Limit the size of a hash by BrowserUk (Patriarch) on Sep 05, 2013 at 10:40 UTC
Based on the assumption that you're using a hash because you want all the records containing each of the top 10 values, this uses two passes. One to extract and subset the top 10 values. The second to print the records containing those values: `#! perl -sw use strict; open FH, '<', $ARGV[ 0 ] or die $!; ## First pass; find all the values my %vals; m[(\S+)$] and ++$vals{ $1 } while <FH>; ## Select the top 10 and put them in a hash for fast lookup. my %top10; undef @top10{ ( sort{ $b <=> $a } keys %vals )[ 0 .. 9 ] }; ## rewind for second pass seek FH, 0, 0; ## rewind ## And print them if they fit the criteria m[(\S+)$] and exists $top10{ $1 } and print while <FH>; ## close close FH;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk (Patriarch) on Sep 05, 2013 at 14:25 UTC
If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms. O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets. Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%. So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6). Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6). And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878). Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings. This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process: #! perl -slw use strict; use List::Util qw[ sum min ]; use Time::HiRes qw[ time ]; sub timeit(&) { my $soFar = sum( times() ); my $start = time; $_[0]->(); return sprintf "Wall:%6.3f secs; cpu:%6.3f secs", time() - $start, sum( times() ) - $soFar; } our $T //= 10e6; our $N //= 10; our $B //= 10; my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6; print 'Add 1 and sort; add 1 and sort ...: ', timeit { my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; ++$i ) { @top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ]; } # print "@top"; }; print "Add $B and sort; add $B and sort ...: ", timeit{ my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; $i += $B ) { @top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data + ) ]; $#top = $N; } # print "@top"; }; __END__ C:\test>1052498-3 -T=10e6 -B=1e1 Add 1 and sort; add 1 and sort ...: Wall:89.400528 secs; cpu:89.04 +7000 secs Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39 +1000 secs C:\test>1052498-3 -T=10e6 -B=1e2 Add 1 and sort; add 1 and sort ...: Wall:87.353784 secs; cpu:86.43 +8000 secs Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340 +00 secs C:\test>1052498-3 -T=10e6 -B=2e2 Add 1 and sort; add 1 and sort ...: Wall:87.380 secs; cpu:87.046 s +ecs Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s +ecs C:\test>1052498-3 -T=10e6 -B=1e3 Add 1 and sort; add 1 and sort ...: Wall:87.436786 secs; cpu:86.62 +5000 secs Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290 +00 secs C:\test>1052498-3 -T=10e6 -B=1e4 Add 1 and sort; add 1 and sort ...: Wall:87.435 secs; cpu:86.298 s +ecs Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s +ecs [download] Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time. But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by hdb (Monsignor) on Sep 05, 2013 at 14:43 UTC
Yes, but if you first compare the new item with the smallest of the 10 already stored, you will very rarely add it to the set of 10 and re-sort. This would be a "conditionally discard, add, and re-sort" algorithm.	[reply]
Re^3: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk (Patriarch) on Sep 05, 2013 at 15:20 UTC
Yes. Your hand-coded insertion sort fairs much better than those using the built-in sort. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by Limbic~Region (Chancellor) on Sep 05, 2013 at 14:57 UTC
BrowserUk, This reminds me of Better mousetrap (getting top N values from list X). Cheers - L~R	[reply]
Re^3: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk (Patriarch) on Sep 05, 2013 at 15:25 UTC
Indeed. But I'm still not at all convinced that the OP is actually after a simple top 10 from 715 million records. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Limit the size of a hash (use an array) by Anonymous Monk on Sep 05, 2013 at 07:01 UTC
1) use an array (say presized to 12k) 2) when you've read in 12k elements 3) sort the array 4) and discard all but the highest 10 ( splice) Then put it in a hash or whatever you need to do	[reply]