in reply to Re: string to more compact format
in thread string to more compact format

Thanks all for the replies. What i want to do is put all compact strings in a hash, so it uses less memory than the 'normal' strings. I have like 2 million of around 50 character strings. I thus want to store these, and convert them to a compact string format. Once all read, i want to loop through each hash key and convert it back to original string format to do comparisons with another string. In short; -make string compact and put in hash -get key of hash, and convert it to original string. To JavaFan. With vec(), do you mean something like the code below?; my $input_string = "ACGTCAGA"; $input_string =~ tr ACGT 0-3; my $packed_string=""; my $packing_index=0; foreach (split //, $input_string){ vec( $packed_string, $packing_index++, 2 ) = $_; } do you know how i can then convert the $packed_string back to normal readable textstring? So back to original "ACGTCAGA".

Replies are listed 'Best First'.
Re^3: string to more compact format
by BrowserUk (Patriarch) on May 17, 2010 at 11:40 UTC

    Packing the keys will make barely any difference to the storage requirements of your hash as shown below. In the first case, a hash containing 2e6 x 50-byte keys takes 70MB. In the second case, 2e6 x 12-byte keys 67MB:

    $hash{ sprintf "%050d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 70183684 $hash{ sprintf "%012d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 67660852

    If all you want to do is store the strings so you can iterate over them, storing the uncompressed strings in an array will save far more space (30MB versus 67/70MB):

    $a[ $_ ] = sprintf "%050d", $_ for 0 .. 2e6;; print total_size \@a;; 30408448

    You also save the cost of compressing and decompressing.

    There are also other ways of storing and iterating your 2e6 strings that take even less memory but are just as easy and efficient to iterate.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hmm strange that storing a 50 byte key hash is just a fraction larger than a 12-byte key hash. Why is that? I can't store my strings into an array. Since sometimes strings occur more than once and i should keep track of that.
        Hmm strange that storing a 50 byte key hash is just a fraction larger than a 12-byte key hash. Why is that?

        Because for each key/value pair, there are 40 bytes (32-bit, more on 64-bit) of overhead in addition to the key and value data. See Hash structure illustration. So for short keys, most of the space used by a hash is in the internal construction, not the keys & values themselves.

        I can't store my strings into an array. Since sometimes strings occur more than once and i should keep track of that.

        Then use the hash, but don't bother with the compression because you won't gain anything from it. 70MB isn't such a lot these days.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I can't store my strings into an array. Since sometimes strings occur more than once...

        I'm curious, if the strings can occur "more than once" then how do you use the hash? Would you keep track of the number of occurences in the value of the hash?

        ack Albuquerque, NM
Re^3: string to more compact format
by JavaFan (Canon) on May 17, 2010 at 11:39 UTC
    To convert back, just read the packed_string 2 bits at a time, and translate them back. (0 -> "A", 1 -> "C", 2 -> "G", 3 -> "T").