Re^2: Efficient 7bit compression
by dragonchild (Archbishop) on Mar 14, 2005 at 15:48 UTC
|
Some compression algorithms build a dictionary of common substrings and replace the substring with the dictionary index. For example, if you have "My life if I buy a dog". The substrings "y " and "if" are repeated. So, if you replace them with a 1byte number and hoist the substrings into a dictionary somewhere else, you can shorten the string.
Problem is two-fold. First, you have to either mark which substrings are in the dictionary or have all strings in the dictionary. Second, your dictionary costs a certain amount of space. So, it's only likely that large strings will sufficiently amortize the cost of a dictionary to realize savings. Smaller strings will not.
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
Ok I hear ya, I was not taking into account the string specific character frequency table that is used to decode a Huffman encoded string - I guess once could get around that by using a general one based on some character usage criteria, but that would probably not meet their needs.
| [reply] |
Re^2: Efficient 7bit compression
by ikegami (Patriarch) on Mar 14, 2005 at 15:52 UTC
|
You forgot to factor in the size of the dictionary. That's why short strings end up being bigger than the original. For long strings, the size of the dictionary becomes an irrelevant percentage.
| [reply] |
Re^2: Efficient 7bit compression
by Limbic~Region (Chancellor) on Mar 14, 2005 at 15:40 UTC
|
perlfan,
I was paraphrasing diotalevi and hacker experience. I do not know which specific technique they were using or why it resulted in larger strings (I am assuming bookkeeping that didn't result in any gain). The point was that the whatever methods were available/tried didn't help.
| [reply] |
Re^2: Efficient 7bit compression
by abell (Chaplain) on Mar 15, 2005 at 14:05 UTC
|
In general, all lossless compression algorithms will produce output longer than the input for some values of the input.
Here is a sketchy proof:
there is a finite number of strings of length at most n. If none of them were compressed to a longer string, it would mean that their compressed values would be again the set of all strings of length at most n (i.e. the compression would be a permutation of said set). This implies that any string would be "compressed" to a string of the same length.
In other words, the only 1-1 injective (i.e. lossless) mappings that never cause length to increase are those that keep length the same.
Update: I changed 1-1 with injective, which makes the statement a bit stronger and more directly related to the sketch of proof. As a side note, I should point out that compression algorithms should tend to be 1-1 anyway, so as to avoid "wasting" short strings.
As to MidLifeXis's comment, of course by "compression algorithm" I meant something which strictly compresses at least one input, as pointed out by tilly.
Cheers
Antonio
The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket
| [reply] |
|
|
| [reply] |
|
|
Your nit is incorrect and the proof (at least the one that is there now) is completely correct. If a lossless compression algorithm compresses any string, then there must be some other string that is made longer by that algorithm. If a lossless transform does not make any strings longer, then it also cannot make any shorter and so isn't a compression algorithm.
A more formal version of the proof is as follows. Suppose that a lossless compression algorithm manages to compress some string of length longer than n to length n or less. There are exactly 2**n + 2**(n-1) + ... + 2**0 = 2**(n+1) - 1 strings of length n or less. If the algorithm does not make any string of length n or less longer than n, then at least 2**(n+1) strings must map into this set, and by the pidgeonhole principle at least 2 strings are sent to the same result. However if we get that resulting then we can't figure out the original string (because we could have started with 2 different strings) contradicting our assumption that the algorithm was lossless.
Therefore a lossless compression algorithm that manages to compress any string must make some other string longer.
(Of course in practice one hopes that the strings which are made longer are likely to be rare and not made much longer, while the strings that are made shorter are likely to be common and made much shorter...)
| [reply] |
|
|