Efficient 7bit compression

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient 7bit compression by Roy Johnson (Monsignor) on Mar 14, 2005 at 15:49 UTC
It's easy, but does it count as efficient to expand the strings this way? `use strict; use warnings; my $str = 'abcdefghijklmnop'; sub shrink { my $bitstr = unpack('B', $_[0]); $bitstr =~ s/.(.{7})/$1/g; pack('B', $bitstr); } sub grow { my $bitstr = unpack('B', $_[0]); $bitstr =~ s/(.{7})/0$1/g; pack('B', $bitstr); } my $shrunk = shrink($str); printf "Was: %s, Is: %s\n", length($str), length($shrunk); printf "Restored: <%s>\n", grow $shrunk;` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Efficient 7bit compression by Limbic~Region (Chancellor) on Mar 14, 2005 at 15:52 UTC
Roy Johnson, This looks similar to my unshown trading memory for time solution. The $bitstr will temporarily be 8 times larger than the original $str. I outlined 2 kinds of efficiency in my post (RAM & runtime). Now if there was a way to have our cake and eat it too. Thanks! Cheers - L~R	[reply]
Re^3: Efficient 7bit compression by Roy Johnson (Monsignor) on Mar 14, 2005 at 16:05 UTC
This has constant overhead by expanding only 8 chars at a time. `sub shrink { my $packed = ''; while ($_[0] =~ /(.{1,8})/gs) { my $bitstr = unpack('B', $1); $bitstr =~ s/.(.{7})/$1/g; $packed .= pack('B', $bitstr); } $packed; } sub grow { my $unpacked = ''; while ($_[0] =~ /(.{1,7})/gs) { my $bitstr = unpack('B', $1); $bitstr =~ s/(.{7})/0$1/g; $unpacked .= pack('B', $bitstr); } $unpacked; }` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^4: Efficient 7bit compression by Limbic~Region (Chancellor) on Mar 14, 2005 at 16:14 UTC
Re^5: Efficient 7bit compression by Roy Johnson (Monsignor) on Mar 14, 2005 at 16:36 UTC
Some notes below your chosen depth have not been shown here
Re^3: Efficient 7bit compression by Ven'Tatsu (Deacon) on Mar 14, 2005 at 19:43 UTC
I'm not sure how this fares in reguard to speed, but it only uses 2-5 scalars, tries to avoid generating intermidiate lists, and modifies their arguments inplace. sub shrink { my $dbit = 0; for (my $sbit = 0; $sbit < length($_[0]) * 8; $sbit++) { next if $sbit % 8 == 7; vec($_[0], $dbit++, 1) = vec($_[0], $sbit, 1); } my $dlen = length($_[0]) * 7 / 8; $dlen++ unless $dlen == int($dlen); $dlen = int($dlen); my $extra = length($_[0]) - $dlen; if ($extra > 0) { substr($_[0], $dlen, $extra, ''); } for (my $pbit = $dbit; $pbit < $dlen * 8; $pbit++) { vec($_[0], $pbit, 1) = 0; } } sub grow { my $sbit = int(length($_[0]) * 8 / 7) * 7 - 1; for (my $dbit = int(length($_[0]) * 8 / 7) * 8 - 1; $dbit >= 0; $d +bit--) { vec($_[0], $dbit, 1) = $dbit % 8 == 7 ? 0 : vec($_[0], $sbit-- +, 1); } } [download] I have the nagging fealing there is a better way to implement `ceil` (near the middle of `shrink`). If padding the compressed string with 0 bits is not needed then the second for loop of `shrink` can be omitted to save (on average) 4 bits of time.	[reply] [d/l] [select]
Re^4: Efficient 7bit compression by Roy Johnson (Monsignor) on Mar 14, 2005 at 20:22 UTC
Re: Efficient 7bit compression by talexb (Chancellor) on Mar 14, 2005 at 17:12 UTC
I'm not sure if the original request was for 7-bit compression or whether it's for text strings, which theoretically use values 32-127. If the latter, you can take six values ranging from 1-96 and compress them into a five byte 'chunk' since: `96^6 = 782,757,789,696 256^5 = 1,099,511,627,776` [download] However, there my be better compression rates available depending on what you know about the data ahead of time. I know I spent a happy two or three weeks researching this (in the late 80's, using Borland's Turbo C) -- it's a very neat area to research. One more thought -- if you've got a large enough sample size, you can start to encode pairs ('en', 't ') or triplets ('en ', 'e, '). It's fun stuff. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l]
Re: Efficient 7bit compression by beauregard (Monk) on Mar 14, 2005 at 17:41 UTC
You might also be able to get it down to a 5 or 6 bit encoding by pre-scanning the string and determining the lowest ASCII value and the range. The if that range is small enough, you can use fewer bits to encode. Of course, since hacker was trying to compress URL's, the range is usually going to be 73 chars (URL's always contain period (ASCII 46) and usually go up to 'w' (ASCII 119)), so 7-bits is about minimum. And by encoding the normalized value and bit count you just added two bytes... c.	[reply]
Re: Efficient 7bit compression by perlfan (Vicar) on Mar 14, 2005 at 15:36 UTC
Traditional compression techniques fail because with short strings they end up being bigger than the original. I am not sure I understand how this would be the case. A worst case Huffman encoding could result in 0% compression, and I don't see how inflation is even possible. I am not familiar with other compression techniques, so I can't say that for all techniques. Nonetheless, this will surely be useful in some applications.	[reply]
Re^2: Efficient 7bit compression by dragonchild (Archbishop) on Mar 14, 2005 at 15:48 UTC
Some compression algorithms build a dictionary of common substrings and replace the substring with the dictionary index. For example, if you have "My life if I buy a dog". The substrings "y " and "if" are repeated. So, if you replace them with a 1byte number and hoist the substrings into a dictionary somewhere else, you can shorten the string. Problem is two-fold. First, you have to either mark which substrings are in the dictionary or have all strings in the dictionary. Second, your dictionary costs a certain amount of space. So, it's only likely that large strings will sufficiently amortize the cost of a dictionary to realize savings. Smaller strings will not. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^3: Efficient 7bit compression by perlfan (Vicar) on Mar 14, 2005 at 15:59 UTC
Ok I hear ya, I was not taking into account the string specific character frequency table that is used to decode a Huffman encoded string - I guess once could get around that by using a general one based on some character usage criteria, but that would probably not meet their needs.	[reply]
Re^2: Efficient 7bit compression by ikegami (Patriarch) on Mar 14, 2005 at 15:52 UTC
You forgot to factor in the size of the dictionary. That's why short strings end up being bigger than the original. For long strings, the size of the dictionary becomes an irrelevant percentage.	[reply]
Re^2: Efficient 7bit compression by Limbic~Region (Chancellor) on Mar 14, 2005 at 15:40 UTC
perlfan, I was paraphrasing diotalevi and hacker experience. I do not know which specific technique they were using or why it resulted in larger strings (I am assuming bookkeeping that didn't result in any gain). The point was that the whatever methods were available/tried didn't help. Cheers - L~R	[reply]
Re^2: Efficient 7bit compression by abell (Chaplain) on Mar 15, 2005 at 14:05 UTC
In general, all lossless compression algorithms will produce output longer than the input for some values of the input. Here is a sketchy proof: there is a finite number of strings of length at most n. If none of them were compressed to a longer string, it would mean that their compressed values would be again the set of all strings of length at most n (i.e. the compression would be a permutation of said set). This implies that any string would be "compressed" to a string of the same length. In other words, the only ~~1-1~~ injective (i.e. lossless) mappings that never cause length to increase are those that keep length the same. Update: I changed 1-1 with injective, which makes the statement a bit stronger and more directly related to the sketch of proof. As a side note, I should point out that compression algorithms should tend to be 1-1 anyway, so as to avoid "wasting" short strings. As to MidLifeXis's comment, of course by "compression algorithm" I meant something which strictly compresses at least one input, as pointed out by tilly. Cheers Antonio The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket	[reply]
Re^3: Efficient 7bit compression by MidLifeXis (Monsignor) on Mar 15, 2005 at 17:58 UTC
One small nit... In general, all lossless compression algorithms will produce output equal in length or longer than the input for some values of the input. Use an identity function as your compressor. Otherwise, good back of the napkin / handwavey proof. --MidLifeXis	[reply]
Re^4: Efficient 7bit compression by tilly (Archbishop) on Mar 15, 2005 at 21:52 UTC
Re^5: Efficient 7bit compression by MidLifeXis (Monsignor) on Mar 17, 2005 at 13:19 UTC
Re: Efficient 7bit compression by gam3 (Curate) on Mar 14, 2005 at 21:54 UTC
You can get something closer to 30% compression using a hoffman coding. Code at http://www.brettsbsd.net/~estrabd/SCHOOL/CS638/Huffman.pm. Using only 4 urls, as example input, I was able to get between 34% and 40% compression. use strict; use lib qw(./); use Huffman qw(:All); use Data::Dumper; my @word = ( 'http://www.perlmonks.org/index.pl?parent=43933;node_id=3333', 'http://www.coasthome.com/things.html', 'http://validator.w3.org/check', 'http://www.google.com/search?q=emdash+html&sourceid=mozilla-search&st +art=&start=&ie=utf-8&oe=utF-8&client=firefox&rls=org.mozilla:en-US:un +official', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', ); # build tree my @paths = build_encodings(join('', @word)); my $path_hash = build_encodings_hash(join('', @word)); my $tree = build_tree(join('', @word)); use Data::Dumper; print Dumper $path_hash; for my $string (qw ( http://lawyers.findlaw.com/lawyer/firm/Estate-Planning/Fort-Br +agg/North-Carolina http://www.seacrestproperties.com/ http://www.fortbragg.com/businesses_all.lasso ) ) { my $encoded_str; my $str; foreach my $item (split('', $string)) { if (my $ele = $path_hash->{$item}) { $encoded_str .= $ele->{ENCODING}; } else { die "No encoding for $item\n"; } } print "String: $string\n"; print "Encoded:\n$encoded_str\n"; # converting this to a bit vector is left as a exercise. $encoded_str = $encoded_str; my $decoded = decode_str($encoded_str,@paths); print "Decoded:\n".$decoded."\n"; print "Stats: \n"; print "Original chars: ".length($string)."\n"; print "Original size (in bits): ".(8length($string))."\n"; print "Encoded size (in bits): ".length($encoded_str)."\n"; print "Size reduction: %".((1-(length($encoded_str)/(8length($str +ing))))*100)."\n"; } [download] String: http://lawyers.findlaw.com/lawyer/firm/Estate-Planning/Fort-Bragg/North-Carolina Encoded: 110111...0100101111100 Decoded: http://lawyers.findlaw.com/lawyer/firm/Estate-Planning/Fort-Bragg/North-Carolina Stats: Original chars: 80 Original size (in bits): 640 Encoded size (in bits): 420 Size reduction: %34.375 String: http://www.seacrestproperties.com/ Encoded: 1101110...1011000 Decoded: http://www.seacrestproperties.com/ Stats: Original chars: 34 Original size (in bits): 272 Encoded size (in bits): 161 Size reduction: %40.8088235294118 String: http://www.fortbragg.com/businesses_all.lasso Encoded: 1101110...0001100010110 Decoded: http://www.fortDragg.com/Dusinesses_all.lasso Stats: Original chars: 45 Original size (in bits): 360 Encoded size (in bits): 231 Size reduction: %35.8333333333333 A picture is worth a thousand words, but takes 200K.	[reply] [d/l]
Re: Efficient 7bit compression by QM (Parson) on Mar 15, 2005 at 00:06 UTC
Does it make any sense to incorporate several approaches, and make the system adaptive? Let's say we've done some benchmarking (which I haven't), and I have these systems, breakpoints and savings (negative is space saved): `string length system 0-10 plain (+10%) 11-50 7bit packing (-10%) 51-100 huffman (-20%) 101- LZ (-35%)` [download] Now, the first 2 bits (or byte, if plain) indicates which system is being used. Note that the plain text system suffers because of the overhead of the leading indicator. Would a system like this work? What would the breakpoints be? Is it worthwhile to combine any elements together? Or is it the case that benchmarking should be used to find the system that gives the best performance on the typical data, and stick with that? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks