Re: Limit the size of a hash
by Corion (Patriarch) on Sep 05, 2013 at 07:02 UTC
|
How about just keeping the first 10 elements, and whenever you find a new element, just keeping the 10 highest scoring elements? That way you have never more than 11 elements in memory.
| [reply] |
Re: Limit the size of a hash
by BrowserUk (Patriarch) on Sep 05, 2013 at 09:37 UTC
|
There is an anomaly in what you are seeking to do.
And 8GB file of 12 character lines gives ~715 million lines.
With the selection value being a value between 0.00 & 0.99, there are only 100 possible values; which means that there could be an average of ~7 million of each value. Which 10 of the 7 million 0.99 value liens do you want?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
Re: Limit the size of a hash
by space_monk (Chaplain) on Sep 05, 2013 at 08:42 UTC
|
You could use a package such as Tie::Array::Sorted to maintain a sorted array of your 10 highest scores (mapped to the corresponding code); once the array increases in size to above 10, just remove the lowest (last) item on the array as you go through the file.
The following needs a little refinement/ debugging, but shows an outline of how I think it should work....
use Tie::Array::Sorted;
tie @a, "Tie::Array::Sorted", sub { $_[0]->{score} <=> $_[1]->{score}
+};
# read file line by line
while (<FH>) {
# use split if you hate regexs... :-)
if ( /((\d ){4})(\d.\d+)/) {
# add new element to array
push @a, { 'score' => $2, 'code' => $1 };
# remove lowest scoring element if more than 10
if ($#a > 10) {
pop @a; # remove lowest scoring/last element
}
}
}
If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
| [reply] [d/l] |
Re: Limit the size of a hash
by Random_Walk (Prior) on Sep 05, 2013 at 08:26 UTC
|
use strict;
use warnings;
use Data::Dumper;
my $keep = 5;
my @keeper;
# populate our store with the first $keep vals
for (1 .. $keep) {
my @vals = split /\s+/, <DATA>;
my $score = pop @vals;
push @keeper, [$score, \@vals];
}
# Sort it so the smallest is at the end
@keeper = sort {$a->[0] <=> $b->[0]} @keeper;
print "$_->[0], " for @keeper;
while (<DATA>) {
my @vals = split;
next unless @vals; # could do a more sophisticated test
my $score = pop @vals;
if ($score > $keeper[0]->[0]) {
$keeper[0] = [$score, \@vals];
@keeper = sort {$a->[0] <=> $b->[0]} @keeper;
print "\nAdded $score: now have: ";
print "$_->[0], " for @keeper;
}
}
__DATA__
0 5 3 7 0.97
1 5 4 8 0.18
1 5 4 8 0.21
1 5 4 8 0.22
1 5 4 8 0.52
1 5 4 8 0.16
1 5 4 8 0.18
1 5 4 8 0.72
1 5 4 8 0.32
1 5 4 8 0.17
1 5 4 8 0.52
1 5 4 8 0.92
6 8 4 6 0.54
Cheers, R.
Pereant, qui ante nos nostra dixerunt!
| [reply] [d/l] |
Re: Limit the size of a hash
by Tux (Canon) on Sep 05, 2013 at 08:49 UTC
|
If you need to analyze the data more often than once and speed is not the main issue, but size is, put the hash in a database. Use something like select … from hashtable order by … limit 10;. You can fill the database with e.g. Tie::Hash::DBD and have the best of both worlds.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
Re: Limit the size of a hash
by hdb (Monsignor) on Sep 05, 2013 at 11:20 UTC
|
And here my version based on insertion sort, complete with random data generation and testing:
use strict;
use warnings;
sub nextrecord {
return undef unless shift;
# generate random records
return { code => join( " ", (int rand 10), (int rand 10), (in
+t rand 10), (int rand 10) ),
score => rand };
}
my $keep = 10;
my $counter = 1000;
# initialize with the first 10 records sorted
my @largest = sort { $a->{score} <=> $b->{score} } map nextrecord( $co
+unter-- ), 1..$keep;
my @test = @largest; # for testing
while( my $next = nextrecord( $counter-- ) ){
push @test, $next; # keep all for testing
my $pos = 0; # insertion sort
while(($pos < $keep) && ($largest[$pos]->{score} < $next->{sco
+re}) ) {
$largest[$pos-1] = $largest[$pos] if $pos;
$pos++;
}
$largest[$pos-1] = $next if $pos;
}
print map { "$_->{code}\t$_->{score}\n" } @largest;
# test
print "\n";
print map { "$_->{code}\t$_->{score}\n" } ( sort { $a->{score} <=> $b-
+>{score} } @test )[-10..-1];
| [reply] [d/l] |
|
|
use strict;
use warnings;
use Data::Dumper;
my $keep = 5;
my @keeper;
push @keeper, [0] for 1 .. $keep;
while (<DATA>) {
my @vals = split;
next unless @vals; # possibly more sophisticated test
my $score = pop @vals;
addlist ([$score, \@vals], \@keeper);
}
print Dumper \@keeper;
# slide up the list knocking down existing records
# until we no longer have the biggest
sub addlist {
my ($rec, $list) = @_;
for my $i (0 .. $#$list) {
if ($rec->[0] > $list->[$i]->[0]) {
$list->[$i-1] = $list->[$i] if $i;
$list->[$i] = $rec;
}
else {
last;
}
}
}
__DATA__
0 5 3 7 0.97
1 3 4 8 0.18
1 4 4 8 0.21
1 7 4 4 0.22
1 8 4 8 0.52
1 9 4 1 0.16
1 5 1 8 0.18
1 5 2 9 0.72
1 5 5 8 0.32
1 5 6 6 0.17
1 5 7 4 0.52
1 5 9 3 0.92
1 5 4 8 0.99
6 8 4 6 0.54
Cheers, R.
Pereant, qui ante nos nostra dixerunt!
| [reply] [d/l] |
|
|
| [reply] |
|
|
Re: Limit the size of a hash
by brx (Pilgrim) on Sep 05, 2013 at 09:38 UTC
|
Similar to Random_Walk version (Re: Limit the size of a hash), but without sort.
Here, we keep five elements.
If necessary, we calculate the index of the lowest element (min of highest).
If score of the new element is better than "min of highest", "min of highest" is replaced by new element.
use strict;
use warnings;
my $keep = 5;
#init
my @highest = map scalar<DATA>,1..$keep; #fisrt lines
my @scores = map +(split)[-1],@highest; #score is last element (-1) of
+ each line
my $indexofmin = 0;
my $updated = 1;
####
while(my $line=<DATA>) {
my $score = (split/ /,$line)[-1];
if ($updated) {
for my $i (0..$#scores) {
$indexofmin = $i if $scores[$i]<$scores[$indexofmin];
}
}
if ($score>$scores[$indexofmin]) {
$highest[$indexofmin]=$line;
$scores[$indexofmin]=$score;
$updated = 1;
} else {
$updated = 0;
}
}
print @highest;
__DATA__
0 5 3 7 0.97
1 5 4 8 0.18
1 5 4 8 0.21
1 5 4 8 0.22
1 5 4 8 0.52
1 5 4 8 0.16
1 5 4 8 0.18
1 5 4 8 0.72
1 5 4 8 0.32
1 5 4 8 0.17
1 5 4 8 0.52
1 5 4 8 0.92
6 8 4 6 0.54
English is not my mother tongue.
Les tongues de ma mère sont "made in France".
| [reply] [d/l] [select] |
Re: Limit the size of a hash
by BrowserUk (Patriarch) on Sep 05, 2013 at 10:40 UTC
|
Based on the assumption that you're using a hash because you want all the records containing each of the top 10 values, this uses two passes. One to extract and subset the top 10 values. The second to print the records containing those values:
#! perl -sw
use strict;
open FH, '<', $ARGV[ 0 ] or die $!;
## First pass; find all the values
my %vals; m[(\S+)$] and ++$vals{ $1 } while <FH>;
## Select the top 10 and put them in a hash for fast lookup.
my %top10; undef @top10{ ( sort{ $b <=> $a } keys %vals )[ 0 .. 9 ] };
## rewind for second pass
seek FH, 0, 0; ## rewind
## And print them if they fit the criteria
m[(\S+)$] and exists $top10{ $1 } and print while <FH>;
## close
close FH;
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms)
by BrowserUk (Patriarch) on Sep 05, 2013 at 14:25 UTC
|
If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms.
O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets.
Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%.
So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6).
Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6).
And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878).
Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings.
This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process:
#! perl -slw
use strict;
use List::Util qw[ sum min ];
use Time::HiRes qw[ time ];
sub timeit(&) {
my $soFar = sum( times() );
my $start = time;
$_[0]->();
return sprintf "Wall:%6.3f secs; cpu:%6.3f secs",
time() - $start, sum( times() ) - $soFar;
}
our $T //= 10e6;
our $N //= 10;
our $B //= 10;
my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6;
print 'Add 1 and sort; add 1 and sort ...: ', timeit {
my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];
for( my $i = $N; $i < $T; ++$i ) {
@top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ];
}
# print "@top";
};
print "Add $B and sort; add $B and sort ...: ", timeit{
my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];
for( my $i = $N; $i < $T; $i += $B ) {
@top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data
+ ) ];
$#top = $N;
}
# print "@top";
};
__END__
C:\test>1052498-3 -T=10e6 -B=1e1
Add 1 and sort; add 1 and sort ...: Wall:89.400528 secs; cpu:89.04
+7000 secs
Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39
+1000 secs
C:\test>1052498-3 -T=10e6 -B=1e2
Add 1 and sort; add 1 and sort ...: Wall:87.353784 secs; cpu:86.43
+8000 secs
Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340
+00 secs
C:\test>1052498-3 -T=10e6 -B=2e2
Add 1 and sort; add 1 and sort ...: Wall:87.380 secs; cpu:87.046 s
+ecs
Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s
+ecs
C:\test>1052498-3 -T=10e6 -B=1e3
Add 1 and sort; add 1 and sort ...: Wall:87.436786 secs; cpu:86.62
+5000 secs
Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290
+00 secs
C:\test>1052498-3 -T=10e6 -B=1e4
Add 1 and sort; add 1 and sort ...: Wall:87.435 secs; cpu:86.298 s
+ecs
Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s
+ecs
Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time.
But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Yes, but if you first compare the new item with the smallest of the 10 already stored, you will very rarely add it to the set of 10 and re-sort. This would be a "conditionally discard, add, and re-sort" algorithm.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Limit the size of a hash (use an array)
by Anonymous Monk on Sep 05, 2013 at 07:01 UTC
|
1) use an array (say presized to 12k) 2) when you've read in 12k elements 3) sort the array 4) and discard all but the highest 10 ( splice) Then put it in a hash or whatever you need to do
| [reply] |