Boetsie has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, i have alot of DNA strings with only characters of A, C, G or T. This string is now 32 bits, while I only need two bits for all the characters; A = 00 C = 01 G = 10 T = 11 How can i convert, for example, 'AGTCACA' to a more compact string with less bits? Basically, I want to store this string to a hash and be able to compare them, and see if substrings are availabe. If so, it needs to be printed. Cheers, Marten

Replies are listed 'Best First'.
Re: string to more compact format
by moritz (Cardinal) on May 17, 2010 at 08:19 UTC
    How can i convert, for example, 'AGTCACA' to a more compact string with less bits?

    Build a hash of all 256 4-tuples of (A, G, T, C), mapping each tuple to one byte. And then iterate over the string, four characters at a time, and use the hash for lookup.

    Basically, I want to store these string to a hash and be able to compare them, and see if substrings are availabe.

    The compression that you described makes that much harder, unless all substrings are then aligned to byte boundaries (ie blocks of four in the original string).

    Perl 6 - links to (nearly) everything that is Perl 6.
      Could you explain what you are trying to achieve ? For example do you need to be able to match parts of the DNA sequence ? can these partial sequences be offset (eg a sequence 53 bytes into 1 string matching another 27 bytes into another) ?
Re: string to more compact format
by JavaFan (Canon) on May 17, 2010 at 09:30 UTC
    You could make a pass over the string, and use vec() to write two bits for every character encountered. This should give you an instant 75% savings.

    There may be better compression possible. But that will depend on how much repetition there is of substring. Assuming the string isn't some trivial length, you could try to see how much bzip2 or gzip (or your favourite compression program) compresses the string (I would do this after doing the 75% savings trick from above).

      Thanks all for the replies. What i want to do is put all compact strings in a hash, so it uses less memory than the 'normal' strings. I have like 2 million of around 50 character strings. I thus want to store these, and convert them to a compact string format. Once all read, i want to loop through each hash key and convert it back to original string format to do comparisons with another string. In short; -make string compact and put in hash -get key of hash, and convert it to original string. To JavaFan. With vec(), do you mean something like the code below?; my $input_string = "ACGTCAGA"; $input_string =~ tr ACGT 0-3; my $packed_string=""; my $packing_index=0; foreach (split //, $input_string){ vec( $packed_string, $packing_index++, 2 ) = $_; } do you know how i can then convert the $packed_string back to normal readable textstring? So back to original "ACGTCAGA".

        Packing the keys will make barely any difference to the storage requirements of your hash as shown below. In the first case, a hash containing 2e6 x 50-byte keys takes 70MB. In the second case, 2e6 x 12-byte keys 67MB:

        $hash{ sprintf "%050d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 70183684 $hash{ sprintf "%012d", $_ } = 1 for 0 .. 2e6;; print total_size \%hash;; 67660852

        If all you want to do is store the strings so you can iterate over them, storing the uncompressed strings in an array will save far more space (30MB versus 67/70MB):

        $a[ $_ ] = sprintf "%050d", $_ for 0 .. 2e6;; print total_size \@a;; 30408448

        You also save the cost of compressing and decompressing.

        There are also other ways of storing and iterating your 2e6 strings that take even less memory but are just as easy and efficient to iterate.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        To convert back, just read the packed_string 2 bits at a time, and translate them back. (0 -> "A", 1 -> "C", 2 -> "G", 3 -> "T").
Re: string to more compact format
by Utilitarian (Vicar) on May 17, 2010 at 09:57 UTC
    Wouldn't your method mean that the substring would only be found from the first 4-char border the match crossed?

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."