Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Bidirectional lookup algorithm? (Updated: further info.)

by Eily (Monsignor)
on Jan 07, 2015 at 19:28 UTC ( [id://1112543]=note: print w/replies, xml ) Need Help??


in reply to Bidirectional lookup algorithm? (Updated: further info.)

I don't know if you have found what you were looking for or not, but here is another way to win a little space: hashes use the stringified value of keys, so most 64 bits ints will be written with 18 bytes or more*. But if you pack them before, it will only use 8 bytes.

$i=1; $H{ 2**52 + $i++ } = $_, $G{ pack "Q", 2**52 + $i++ } = $_ for " +aaaa".."zzzz"; say total_size(\%H); say total_size(\%G) __DATA__ 63601240 59945432
This shoudln't change much for values, though it would avoid the scalars from having both a string and int value.

On my computer:

perl -E '$a = 2**53+1; $b = 2**53+2; say "Hello" if $a != $b and $a eq + $b; say "End"' Hello End
perl -Mbignum -E '$a = 2**53+1; $b = 2**53+2; say "Hello" if $a != $b +and $a eq $b; say "End"' End
(numbers past a certain point are stringified to the same thing)

Think: typical variable names!
if that's \w+, that's 63 possible values per byte. So there probably is a possibility of compression here as well, even maybe huffman coding. But if most of those strings are only up to 8 bytes long, I'm not sure there's going to be an actual improvement (though in this case, this will save space both in the keys and the values).

* Numbers below 2^27 may have less than 4 digits in base 10, but that's one number every 2^37. And numbers above 2^30 are all shorters once packed than stringified, and there's 2^34 more numbers above that point than below.

Replies are listed 'Best First'.
Re^2: Bidirectional lookup algorithm? (Updated: further info.)
by graff (Chancellor) on Jan 07, 2015 at 23:06 UTC
    Given that the numbers can range from 1 to 20 (decimal) digits, a simple pack will always make an 8-byte string, but if values less than 10_000_000 happen to predominate (the OP suggests there's an upper bound of 10M elements to be tracked in the application), the result will be a net increase of memory usage.

    Apart from that, there's a potential risk that some 64-bit int values, when packed into an 8-byte string, might collide with actual strings encountered in the data. There's a better way to pack integers into smaller strings.

    The following "encode/decode" functions will always return a string of characters in the range "\x80" - "\xff", and the number of characters will be roughly half the number of decimal digits. (I'm assuming that BrowserUK's strings will always be in the ASCII range, so I'm using non-ASCII characters that happen to be stored internally as single bytes in perl):

    sub int2str { my $int = shift; my $str = ''; while ( $int ) { $str .= chr(( $int % 128 ) + 0x80 ); $int = int( $int / 128 ); } $str; } sub str2int { my $mult = 1; my $int = 0; for ( split( //, shift )) { $int += ( ord() - 0x80 ) * $mult; $mult *= 128; } $int; }
    (There might be more efficient ways to implement that.)

    Even so, the overall space saved seems a bit disappointing - on my (macports, 5.12) version of perl, encoding the numbers like that over the ~914K entries in BrowserUK's single-hash example (strings 'aaaa' .. 'zzzz', numbers incrementing from 1) yielded just 4.1% reduction in the total_size of the hash. (If I use random integers ranging up to 15 digits, it edges closer to 6% reduction.)

Re^2: Bidirectional lookup algorithm? (Updated: further info.)
by BrowserUk (Patriarch) on Jan 07, 2015 at 20:30 UTC
    (numbers past a certain point are stringified to the same thing)

    I have 64-bit ints here, so that isn't a problem for me.

    I did consider packing the keys, but the space saved is only around 15% to 20%; but the performance penalty is considerable. Thanks.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Bidirectional lookup algorithm? (Updated: further info.)
by oiskuu (Hermit) on Jan 09, 2015 at 05:05 UTC

    Exponentiation is implemented via pow internally, so the result is forced to floating point. Better use the shift operator: $a = (1<<53)+1; ...

    Set size of 35e6 has been mentioned, and the desire to push it further. Elements are unique. In decimal notation, certainly more than half of the values are going to be at least eight digits.

    Similar concerns hold for the symbols. 35e6 is just about 25 bits, but there would be no problem if the coding weren't sparse to begin with, so let's add a few bits for good measure. Base64 is 6 bits per char, [a-z] is 4.7 bits. In those cases, most of the symbols are going to be no less than five or six characters, respectively.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1112543]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-03-29 01:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found