Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

A way to pack string into another light weight structure

by Gangabass (Vicar)
on Dec 29, 2011 at 08:09 UTC ( [id://945460]=perlquestion: print w/replies, xml ) Need Help??

Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i need a way to pack/zip a string (10 digit number) into another structure which will takes less memory. My goal is to randomize a big array of numbers (112M of records) but i can't load it into memory right now (only 30M of records). So i think about representing each digit via 4 bits that way i can reduce the size of data structure two times. But i need more compression. So i need some hints. Thanks. Roman Update: trying pack('s', $number) way now...
  • Comment on A way to pack string into another light weight structure

Replies are listed 'Best First'.
Re: A way to pack string into another light weight structure
by BrowserUk (Patriarch) on Dec 29, 2011 at 10:44 UTC
    trying pack('s', $number) way now...

    That won't work. pack template 's' is signed shorts which only handles numbers from -32768 through +32767.

    If your 10 digit numbers can go all the way up to 9,999,999,999, then the smallest pack template you could use is 'Q' -- if your perl supports 64-bit integers -- and that will produce 8-byte values which would only save you 20%.

    Besides which, if you are going to store your packed numbers individually in an array, you won't save much memory as the size of the element itself is insignificant compared to the internals structures that hold them.

    But, if instead, you stored the packed integers into a single string and then shuffled the string, you avoid teh storage overhead of the array.

    If your largest number is less than 4,294,967,296 then you could get away with 'V' (or 'N') which is 4-bytes that would reduce the memory requirement to around 430MB which is doable.

    You could then shuffle that using:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my $t = time; my $s = chr(0); $s x= 112e6 * 4; substr( $s, $_ * 4, 4 ) = pack 'V', scalar <> for 0 .. 112e6-1; my $n = length( $s ) / 4; for( 0 ..$n ) { my $p = $_ + rand( $n - $_ ); my $x = substr( $s, $_*4, 4 ); substr( $s, $_*4,4 ) = substr( $s, $p*4,4 ); substr( $s, $p*4, 4 ) = $x; } print unpack 'V', substr( $s, $_ * 4, 4 ) for 0 .. $n; print STDERR time - $t;

    It'll take about 10 minutes to shuffle your 112 million numbers, which should be quick enough if this is a one off task.

    Perhaps, if your need for true randomness isn't too crucial, then the simplest way to randomise your list would be to sort it using an external sort utility using an offset other than the first digit. That would give you 9 'random' possibilities.

    If that's not sufficient you could pick two offset/lengths:sort -k 8,9 -k 3,4 in > out which gives you many possibilities. Especially if you realise that your keys can be of variable length and even overlap.

    The downside is that unless your sort is substantially quicker than mine, it'll take well over half an hour. But again that might be okay if it is an infrequent requirement.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: A way to pack string into another light weight structure
by wrog (Friar) on Dec 29, 2011 at 10:12 UTC
    pack 'L',$number but this will fail if you have any numbers bigger than 4294967295 (2**32-1), in which case, barring any knowledge of large, unused ranges, 5 bytes is the best you're going to do, and you may as well be doing pack 'H10',$number

    However, instead of trying to compress things, you might be better off using an off-line algorithm that doesn't require everything being in memory at once, e.g.,

    1. open 16 new files for writing
    2. go through the array sequentially, copy each record to a randomly chosen file of the 16.
    3. close all of the files
    4. randomize each file (~7M) separately
    5. concatenate
Re: A way to pack string into another light weight structure
by choroba (Cardinal) on Dec 29, 2011 at 10:52 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://945460]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-23 16:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found