in reply to Array vs. Hash for sparsely integer-indexed data
1. What (if any) storage inefficiency (relative to hashes) is created in perl when using arrays for sparsely indexed data?
It all depends on 'how sparse'":
$a[ $_*2 ] = $_ for 1 .. 1e5;; $h{ $_*2 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 4497312 7393098 undef( @a ); undef( %h );; $a[ $_*20 ] = $_ for 1 .. 1e5;; $h{ $_*20 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 19177376 7493098 undef( @a ); undef( %h );; $a[ $_*50 ] = $_ for 1 .. 1e5;; $h{ $_*50 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 69509024 7526431
The hash size remains static for a given number of entries, regardless of how sparse they are.
At 1 in 2, the array uses a little over half the space of the hash; but by 1 in 20, almost 3 times as much and by 1 in 50 nearly 10 times as much.
And remember, if your integer range starts at 1 billion; the array would require ~12GB (of basically wasted space), to hold the lowest value. Whilst you could wrap that over in an api to subtract 1 billion from each index, the additional subroutine call overhead would completely negate the array's lookup speed advantage; and then some.
2. What (if any) lookup inefficiency (relative to arrays) is created in perl when using hashes for positive integer indexed keys?
Test:
$a[ $_ * 20 ] = $_ for 1 .. 1e5;; $found = 0; say time; defined( $a[ $_ ] ) and ++$found for 1 .. 20e5; +say time; say $found;; 1359141040.60138 1359141041.00958 100000 $h{ $_ * 20 } = $_ for 1 .. 1e5;; $found = 0; say time; exists( $h{ $_ } ) and ++$found for 1 .. 20e5; s +ay time; say $found;; 1359141122.04252 1359141123.06596 100000
The array takes 0.4 seconds to do 2 million lookups of which 5% are found and 95% not.
The hash takes 1.02 seconds to do the same lookups.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Array vs. Hash for sparsely integer-indexed data (bit vectors)
by BrowserUk (Patriarch) on Jan 25, 2013 at 19:21 UTC |