Expanding on what I suggested:

  1. strings which compress to the same number of bytes can be compared directly.

    As per your example:

      ACGT       1b 00
      ACGTA      1b 01
      ACGTAA     1b 02
      ACGTAAA    1b 03
    

  2. strings which compress to different numbers of bytes can be compared directly up to the last byte of the shorter. If that doesn't settle the matter, then you need to use the last bits of the shorter to select a mask...

    Taking your example:

      ACGTAAA    1b 03
      ACGTAAAA   1b 00 00
    
    the strings match up to (but, for the avoidance of doubt, not including) the last byte of the shorter, so the '3' selects the mask 0xFC, under that mask the last corresponding bytes are equal, so the longer string is the larger. Similarly:
      ACGT TGCA AGA    1b F4 23
      ACGT TGCA AGAC   1b F4 21 00
    
    Actually, it's not strictly necessary to apply the mask to both strings, or indeed to use any mask other than 0xFC -- but it seemed tidier.

There is some futzing about to be done to compare the compressed forms, though the decision is likely to be made before the last byte of the shorter.

Of course, if the comparison is only for searching for equality, then it doesn't really matter where you put the length mod 4 (except that if there's little variation in length, putting it at the end keeps it out of the way). Alphabetic order does appeal to a sense of neatness, though.

As far as implementation of packing/unpacking goes, placing the length mod 4 at the end means that all the fiddling about happens at the end, rather than at both ends. I have modified the code I posted previously to do this and included that below. I look forward to improvements :-)

Mind you, if performance is the issue, I'd cast this into C.

# ACGT String compression and decompression # # We compress to fixed length string, 2 bits per symbol. # # In order to recover the length of the string the ls bits of the last + byte contain the # length of the string mod 4 -- this adds a complete zero byte if nece +ssary. my %compress ; # string of up to 4 characters to compressed byte my @expand ; # compressed byte to string (except last byte) my @expand_l ; # compressed last byte to string BEGIN { my @ACGT = qw(A C G T) ; my $s = '' ; my $v = 0 ; foreach my $a (@ACGT) { $s = $a ; $compress{$s."\n"} = $compress{$s} = chr($v+1) ; foreach my $b (@ACGT) { $s = $a.$b ; $compress{$s."\n"} = $compress{$s} = chr($v+2) ; foreach my $c (@ACGT) { $s = $a.$b.$c ; $compress{$s."\n"} = $compress{$s} = chr($v+3) + ; foreach my $d (@ACGT) { $s = $a.$b.$c.$d ; $compress{$s} = chr($v) ; $expand[$v++] = $s ; } ; } ; } ; } ; @expand_l = map substr($expand[$_], 0, ($_ & 3)), (0..255) ; $compress{"\n"} = "\0" ; } ; # Compress ACGT string (all upper case) with or without trailing "\n" # # $compressed = acgt_compress(ACGT_STRING) sub acgt_compress { my $c = '' ; $c .= $compress{$_} for unpack('(a4)*', $_[0]) ; return (length($_[0]) & 3) || (substr($_[0], -1) eq "\n") ? $c : $c +. "\0" ; } ; # Decompress compressed ACGT # # $acgt_string = acgt_expand(COMPRESSED) sub acgt_expand { my $s = '' ; $s .= $expand[$_] for unpack('C*', substr($_[0], 0, -1)) ; return $s . $expand_l[ord(substr($_[0], -1))] ; } ;

In reply to Re^4: Unpack Fail to Decompress String? by gone2015
in thread Unpack Fail to Decompress String? by neversaint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.