in reply to Unpack Fail to Decompress String?
|
|---|
| Replies are listed 'Best First'. | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Re^2: Unpack Fail to Decompress String?
by gone2015 (Deacon) on Jan 09, 2009 at 17:32 UTC | |||||||||||||||
I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.) The advantage is that it makes comparisions slightly easier... | [reply] | ||||||||||||||
by BrowserUk (Patriarch) on Jan 10, 2009 at 10:56 UTC | |||||||||||||||
I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.) Sorry guys, but I don't see any way of making a trailing indicator allow for string comparison. That said, the sorting and comparing of packed ACGT (other than simple equality), is a fairly non-useful thing anyway. Sequence and subsequence representations don't have any intrisic ordering. The more frequent operation is to search one sequence for the presence of another (sub)sequence, and doing that with packed sequences means you're only checking every fourth position, rather than every position. Performing shifts or rotates on bitstring greater than register size is a prohibitively expensive operation. It far outweights any gains you might get from searching quarter sized strings, as you have to perform 4 searches anyway, and you need the expensive shift operation inbetween each search. I think the only useful use for packed ACGT (2bit) format is to reduce storage overhead. It's almost certainly quicker to just unpack the sequences for searching. In which case, the prefix byte of the significant bits in the last byte is a good compromise between compression level and implementation ease. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] | ||||||||||||||
by gone2015 (Deacon) on Jan 10, 2009 at 15:00 UTC | |||||||||||||||
Expanding on what I suggested: There is some futzing about to be done to compare the compressed forms, though the decision is likely to be made before the last byte of the shorter. Of course, if the comparison is only for searching for equality, then it doesn't really matter where you put the length mod 4 (except that if there's little variation in length, putting it at the end keeps it out of the way). Alphabetic order does appeal to a sense of neatness, though. As far as implementation of packing/unpacking goes, placing the length mod 4 at the end means that all the fiddling about happens at the end, rather than at both ends. I have modified the code I posted previously to do this and included that below. I look forward to improvements :-) Mind you, if performance is the issue, I'd cast this into C. | [reply] [d/l] | ||||||||||||||
by samwyse (Scribe) on Jan 13, 2009 at 15:23 UTC | |||||||||||||||
With a trailing full length, the last comes first and the first last, with those in the middle keeping their correct relative orderingSo, don't use the actual length, use length mod 4.
| [reply] | ||||||||||||||
by BrowserUk (Patriarch) on Jan 13, 2009 at 15:32 UTC | |||||||||||||||
by jethro (Monsignor) on Jan 09, 2009 at 18:46 UTC | |||||||||||||||
string length +1 modulo 4 is the same as the number of significant characters in the last byte. I.e. that is an implementation detail I left out on purpose. If you mean comparision for equality, then putting the count into the first byte will be a lot faster for 75% of different-length strings and not much slower for same-length strings. Only if you want to know the first position at which they differ is the last byte better. The advantage I see for last byte is that you can more easily append or cut from the end of the compressed string
| [reply] | ||||||||||||||
by gone2015 (Deacon) on Jan 09, 2009 at 19:11 UTC | |||||||||||||||
Sure. If you're only interested in equality, then having the length component at the front will help -- unless the strings are generally the same length, in which case it makes things worse. For an "alphabetic" comparision the length is only significant if the strings are equal up to the end of shorter, so I reckon the trick is to put the length information at the far end. | [reply] | ||||||||||||||