Re^4: Unpack Fail to Decompress String?

Expanding on what I suggested:

strings which compress to the same number of bytes can be compared directly.

As per your example:
```
  ACGT       1b 00
  ACGTA      1b 01
  ACGTAA     1b 02
  ACGTAAA    1b 03
```
strings which compress to different numbers of bytes can be compared directly up to the last byte of the shorter. If that doesn't settle the matter, then you need to use the last bits of the shorter to select a mask...

Taking your example:
```
  ACGTAAA    1b 03
  ACGTAAAA   1b 00 00
```
the strings match up to (but, for the avoidance of doubt, not including) the last byte of the shorter, so the '3' selects the mask 0xFC, under that mask the last corresponding bytes are equal, so the longer string is the larger. Similarly:
```
  ACGT TGCA AGA    1b F4 23
  ACGT TGCA AGAC   1b F4 21 00
```
Actually, it's not strictly necessary to apply the mask to both strings, or indeed to use any mask other than 0xFC -- but it seemed tidier.

There is some futzing about to be done to compare the compressed forms, though the decision is likely to be made before the last byte of the shorter.

Of course, if the comparison is only for searching for equality, then it doesn't really matter where you put the length mod 4 (except that if there's little variation in length, putting it at the end keeps it out of the way). Alphabetic order does appeal to a sense of neatness, though.

As far as implementation of packing/unpacking goes, placing the length mod 4 at the end means that all the fiddling about happens at the end, rather than at both ends. I have modified the code I posted previously to do this and included that below. I look forward to improvements :-)

Mind you, if performance is the issue, I'd cast this into C.

# ACGT String compression and decompression
#
# We compress to fixed length string, 2 bits per symbol.
#
# In order to recover the length of the string the ls bits of the last
+ byte contain the
# length of the string mod 4 -- this adds a complete zero byte if nece
+ssary.

my %compress ;  # string of up to 4 characters to compressed byte
my @expand ;    # compressed byte to string (except last byte)
my @expand_l ;  # compressed last byte to string

BEGIN {
  my @ACGT = qw(A C G T) ;
  my $s = '' ;
  my $v = 0 ;
  foreach my $a (@ACGT) {
    $s = $a ; $compress{$s."\n"} = $compress{$s} = chr($v+1) ;
    foreach my $b (@ACGT) {
      $s = $a.$b ; $compress{$s."\n"} = $compress{$s} = chr($v+2) ;
      foreach my $c (@ACGT) {
        $s = $a.$b.$c ; $compress{$s."\n"} = $compress{$s} = chr($v+3)
+ ;
        foreach my $d (@ACGT) {
          $s = $a.$b.$c.$d ;
          $compress{$s} = chr($v) ;
          $expand[$v++] = $s ;
        } ;
      } ;
    } ;
  } ;

  @expand_l = map substr($expand[$_], 0, ($_ & 3)), (0..255) ;

  $compress{"\n"} = "\0" ;
} ;

# Compress ACGT string (all upper case) with or without trailing "\n"
#
#   $compressed = acgt_compress(ACGT_STRING)

sub acgt_compress {
  my $c = '' ;
  $c .= $compress{$_} for unpack('(a4)*', $_[0]) ;
  return (length($_[0]) & 3) || (substr($_[0], -1) eq "\n") ? $c : $c 
+. "\0" ;
} ;

# Decompress compressed ACGT
#
#   $acgt_string = acgt_expand(COMPRESSED)

sub acgt_expand {
  my $s = '' ;
  $s .= $expand[$_] for unpack('C*', substr($_[0], 0, -1)) ;
  return $s . $expand_l[ord(substr($_[0], -1))] ;
} ;
[download]

Comment on Re^4: Unpack Fail to Decompress String? Download Code