comment on

Given that the numbers can range from 1 to 20 (decimal) digits, a simple pack will always make an 8-byte string, but if values less than 10_000_000 happen to predominate (the OP suggests there's an upper bound of 10M elements to be tracked in the application), the result will be a net increase of memory usage.

Apart from that, there's a potential risk that some 64-bit int values, when packed into an 8-byte string, might collide with actual strings encountered in the data. There's a better way to pack integers into smaller strings.

The following "encode/decode" functions will always return a string of characters in the range "\x80" - "\xff", and the number of characters will be roughly half the number of decimal digits. (I'm assuming that BrowserUK's strings will always be in the ASCII range, so I'm using non-ASCII characters that happen to be stored internally as single bytes in perl):

sub int2str {
    my $int = shift;
    my $str = '';
    while ( $int ) {
        $str .= chr(( $int % 128 ) + 0x80 );
        $int = int( $int / 128 );
    }
    $str;
}

sub str2int {
    my $mult = 1;
    my $int = 0;
    for ( split( //, shift )) {
        $int += ( ord() - 0x80 ) * $mult;
        $mult *= 128;
    }
    $int;
}
[download]

(There might be more efficient ways to implement that.)

Even so, the overall space saved seems a bit disappointing - on my (macports, 5.12) version of perl, encoding the numbers like that over the ~914K entries in BrowserUK's single-hash example (strings 'aaaa' .. 'zzzz', numbers incrementing from 1) yielded just 4.1% reduction in the total_size of the hash. (If I use random integers ranging up to 15 digits, it edges closer to 6% reduction.)

In reply to Re^2: Bidirectional lookup algorithm? (Updated: further info.) by graff
in thread Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks