Re^3: Heap structure for lookup?

Replies are listed 'Best First'.
Re^4: Heap structure for lookup? by BrowserUk (Patriarch) on May 27, 2015 at 09:28 UTC
This means (binary) trees generally are out question, correct? Certainly any CPAN modules that store generic perl scalars. A custom implementation that stored just a U64 value; giving 24 * 150e6 ~ 3.5GB would be possible. But it's the need for 2 64-bit pointers that is the main problem, hence my mind drifting to heaps. They self order as you insert, and avoid the pointers. But are any of the variations O(logN) searchable? is there any structure on your values that might be leveraged? Or are they more or less random 64bit integers? No. They can take on any of the 2^64 values and are likely to be distributed right across the range. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^5: Heap structure for lookup? by hdb (Monsignor) on May 27, 2015 at 10:44 UTC
If you know the number of values upfront (or an upper barrier to the number) and if the values are roughly equally distributed across the full range (with some clustering of course), you could pre-allocate an array and then come up with a guess where a particular value belongs (the value divided by 2**64 or some algorithm based on the binary representation of that value). Then do some local sorting. If you made the array somewhat larger than needed you would have less clashes.	[reply]
Re^6: Heap structure for lookup? by BrowserUk (Patriarch) on May 27, 2015 at 11:42 UTC
What I've come up with is a very simple & compact hash implementation: #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'Lookup64', CLEAN_AFTER_BUILD =>0, + TYPEMAPS => './lookup64.typemap'; #include "C\mytypes.h" #include "C\rand64.h" #include <time.h> #define N (200000033) U64 *a; void build( int M ) { U32 i; a = calloc( N, sizeof( U64 ) ); srand64( __rdtsc(), _time64( NULL ) ); for( i=0; i < M; ++i ) { U64 r = rand64(); U32 modr = r % N, n = modr, c = 0; while( a[n] && ++c < N ) { n = ( n + modr ) % N; } a[ n ] = r; if( ( i % 1000000 ) == 0 ) printf( "%I64u @ %u\n", r, n ); } } int lookup( U64 v ) { U32 modr = v % N, n = modr, c = 0; while( a[n] != v && a[n] != 0 && ++c < N ) { n = ( n + modr ) % N; } if( a[n] != v ) return 0; return n; } END_C package main; use Time::HiRes qw[ time ]; our $N //= 1000; <STDIN>; my $start = time; build( $N ); my $end = time; printf( "Inserting $N U64s took %.9f seconds\n", $end-$start ); $start = time; while( my $v = <STDIN> ) { chomp( $v ); $start = time; my $p = lookup( $v ); $end = time; printf "%u @ %u in %.9f s\n", $v, $p, $end - $start; } [download] The insertion rate of 150e6 (into a fixed-size array of just over 200e6) is less that 1/4 microsecond per; or 37 second for them all; which is fine. If I push the number of inserts to 190e6, you can visibly see -- watching the trace of every 1e6th insertion -- the insertion rate slow down; but still the insertion rate is less than 1/2 a microsecond per. The lookups, currently done manually by feeding some of the values traced out back in to check they are found, and a few off-the-top-of-my-head other numbers to check they aren't, and the lookup time is around 20 microseconds per, including the perl->XS->perl transitions and conversions. Which is also very acceptable: C:\test>lookup64 -N=150e6 16536075677023473182 @ 171436696 ... 6917323807836132563 @ 155785214 4887574662696880337 @ 47194064 9190776599876911997 @ 197069741 5378457539553008893 @ 15322593 6479208648984222734 @ 29944101 7965142717433510515 @ 185179020 1980553766129900057 @ 1573360 4821546746934524968 @ 179443020 Inserting 150e6 U64s took 38.428204060 seconds 4821546746934524968 4821546746934524968 @ 179443020 in 0.000000000 s 1980553766129900057 1980553766129900057 @ 1573360 in 0.000020027 s 6479208648984222734 6479208648984222734 @ 29944101 in 0.000023127 s 4887574662696880337 4887574662696880337 @ 47194064 in 0.000015020 s 1 1 @ 0 in 0.000015974 s 5 5 @ 0 in 0.000000000 s 12345 12345 @ 0 in 0.000018120 s 1234567890987654321 1234567890987654321 @ 0 in 0.000018835 s 16536075677023473182 16536075677023473182 @ 171436696 in 0.000000000 s [download] All together, I'm really surprised that such a simple hash implementation should work quite so well; even allowing for a 25% to 10% headroom in the array. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]