in reply to Re^2: Unpack Fail to Decompress String?
in thread Unpack Fail to Decompress String?
I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.)
Sorry guys, but I don't see any way of making a trailing indicator allow for string comparison.
input Using a significant bit count in the last 2 bits of the last byte ACGT 1b 00 ACGTA 1b 01 ACGTAA 1b 02 ACGTAAA 1b 03 ACGTAAAA 1b 00 00 ACGTAAAAA 1b 00 01
input Using the significant bit count in the first byte ACGT 00 1b 00 ACGTA 01 1b 00 ACGTAA 02 1b 00 ACGTAAA 03 1b 00 ACGTAAAA 00 1b 00 00 ACGTAAAAA 01 1b 00 00
input packed with trailing network (BE) length ACGT 1b 00 00 00 04 ACGTA 1b 00 00 00 00 05 ACGTAA 1b 00 00 00 00 06 ACGTAAA 1b 00 00 00 00 07 ACGTAAAA 1b 00 00 00 00 08 ACGTAAAAA 1b 00 00 00 00 00 09
input packed with leading network (BE) length ACGT 00 00 00 04 1b ACGTA 00 00 00 05 1b 00 ACGTAA 00 00 00 06 1b 00 ACGTAAA 00 00 00 07 1b 00 ACGTAAAA 00 00 00 08 1b 00 ACGTAAAAA 00 00 00 09 1b 00
Update: Even that doesn't work! These two sort in the wrong order.
in: ACGTACGTACGTACGTATT packed( 9): 000000131b1b1b1b3c in: ACGTACGTACGTACGTAAAA packed(10): 000000141b1b1b1b0000
And the penalty of that for what is intended as a compression mechanism is prohibitive for short sequences. You might lower the overhead by using a short instead of a long, but 64k is not enough for real genome work.
That said, the sorting and comparing of packed ACGT (other than simple equality), is a fairly non-useful thing anyway. Sequence and subsequence representations don't have any intrisic ordering.
The more frequent operation is to search one sequence for the presence of another (sub)sequence, and doing that with packed sequences means you're only checking every fourth position, rather than every position. Performing shifts or rotates on bitstring greater than register size is a prohibitively expensive operation. It far outweights any gains you might get from searching quarter sized strings, as you have to perform 4 searches anyway, and you need the expensive shift operation inbetween each search.
I think the only useful use for packed ACGT (2bit) format is to reduce storage overhead. It's almost certainly quicker to just unpack the sequences for searching. In which case, the prefix byte of the significant bits in the last byte is a good compromise between compression level and implementation ease.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Unpack Fail to Decompress String?
by gone2015 (Deacon) on Jan 10, 2009 at 15:00 UTC | |
|
Re^4: Unpack Fail to Decompress String?
by samwyse (Scribe) on Jan 13, 2009 at 15:23 UTC | |
by BrowserUk (Patriarch) on Jan 13, 2009 at 15:32 UTC |