in reply to string to more compact format

You could make a pass over the string, and use vec() to write two bits for every character encountered. This should give you an instant 75% savings.

There may be better compression possible. But that will depend on how much repetition there is of substring. Assuming the string isn't some trivial length, you could try to see how much bzip2 or gzip (or your favourite compression program) compresses the string (I would do this after doing the 75% savings trick from above).

Replies are listed 'Best First'.
Re^2: string to more compact format
by Boetsie (Sexton) on May 17, 2010 at 11:15 UTC
    Thanks all for the replies. What i want to do is put all compact strings in a hash, so it uses less memory than the 'normal' strings. I have like 2 million of around 50 character strings. I thus want to store these, and convert them to a compact string format. Once all read, i want to loop through each hash key and convert it back to original string format to do comparisons with another string. In short; -make string compact and put in hash -get key of hash, and convert it to original string. To JavaFan. With vec(), do you mean something like the code below?; my $input_string = "ACGTCAGA"; $input_string =~ tr ACGT 0-3; my $packed_string=""; my $packing_index=0; foreach (split //, $input_string){ vec( $packed_string, $packing_index++, 2 ) = $_; } do you know how i can then convert the $packed_string back to normal readable textstring? So back to original "ACGTCAGA".

      Packing the keys will make barely any difference to the storage requirements of your hash as shown below. In the first case, a hash containing 2e6 x 50-byte keys takes 70MB. In the second case, 2e6 x 12-byte keys 67MB:

      $hash{ sprintf "%050d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 70183684 $hash{ sprintf "%012d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 67660852

      If all you want to do is store the strings so you can iterate over them, storing the uncompressed strings in an array will save far more space (30MB versus 67/70MB):

      $a[ $_ ] = sprintf "%050d", $_ for 0 .. 2e6;; print total_size \@a;; 30408448

      You also save the cost of compressing and decompressing.

      There are also other ways of storing and iterating your 2e6 strings that take even less memory but are just as easy and efficient to iterate.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hmm strange that storing a 50 byte key hash is just a fraction larger than a 12-byte key hash. Why is that? I can't store my strings into an array. Since sometimes strings occur more than once and i should keep track of that.
      To convert back, just read the packed_string 2 bits at a time, and translate them back. (0 -> "A", 1 -> "C", 2 -> "G", 3 -> "T").