in reply to reversible pack()?

If you convert ASCII text with say 80 used chars into gzipped hex where you effectively drop yourself down to a 16 char alphabet but still in the (extended) ASCII 8 bit space you need 5x compression just to break even. This is close to the max compression you expect with gzip on text so this is kinda pointless. Normally you pack your ~ 80 char alphabet into a 256 slot 8 bit binary space. This is where a significant part of the compression comes from - using all the available bits efficiently. Hex is not the go.

[root@devel3 root]# cat test.pl #!/usr/bin/perl use Compress::Zlib; my $str = "Hello World!"; my $gzip = Compress::Zlib::memGzip( $str ); my $hex_enc = unpack 'H*', $gzip; my $hex_dec_gzip = pack 'H*', $hex_enc; my $str_dec = Compress::Zlib::memGunzip( $hex_dec_gzip ); print " $str $hex_enc $str_dec "; [root@devel3 root]# ./test.pl Hello World! 1f8b0800000000000003f348cdc9c95708cf2fca49510400a31c291c0c000000 Hello World! [root@devel3 root]#

cheers

tachyon

Replies are listed 'Best First'.
Re: reversible pack()?
by jonadab (Parson) on Feb 10, 2004 at 22:31 UTC
    If you convert ASCII text with say 80 used chars into gzipped hex where you effectively drop yourself down to a 16 char alphabet but still in the (extended) ASCII 8 bit space you need 5x compression just to break even.

    If the reasoning behind the hex representation is to make it "safe" to treat this data in certain ways (such as transmitting on usenet) that might lose control characters and anything over seven bits, then it's possible to do better by using a base larger than 16, but smaller than 256. For example, if you use base 64 with 33 added to each digit when mapping it back into ASCII, you get all nice safe printable characters but manage to store six usable bits in every byte, which is not altogether bad. If you can get better than 25% compression, you'll have a net gain (though perhaps not a large one). On English text of any significant size, better than 25% gain is very achievable with a simple Huffman tree, much less gzip.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Re: reversible pack()?
by monsieur_champs (Curate) on Feb 11, 2004 at 12:39 UTC

    So I need a better approach to achieve compression enought to get trought the break-even. Any suggestions? Maybe I should leave the gzipped string as is, and hope no distortion can happen while editing the uncompressed part of the program?

    Can you point me some nice website or good book about this matter?

    Thank you very much again!


    "In few words, translating PerlMonks documentation and best articles to other languages is like building a bridge to join other Perl communities into PerlMonks family. This makes the family bigger, the knowledge greater, the parties better and the life easier." -- monsieur_champs

      gzip first then base64 encode. To decode undo in opposite order ie base64_decode then gunzip. Still a lot fatter than binary but a lot thinner than hex.

      If you want to know something try Google. There are hundreds of websites dealing with compression. Try say 'tutorial data compression theory' and find sites like......why don't you have a look yourself.

      It is trivial to test this. Just compress, encode a representative string, and check it for LENGTH. Repeat with another encoding. Compare to length of gzip string and you will see how much you are loosing.

      use Compress::Zlib; use MIME::Base64; my $str = "Hello World! " x 3; my $gzip = Compress::Zlib::memGzip( $str ); my $hex = unpack 'H*', $gzip; my $base64 = encode_base64('Aladdin:open sesame'); my $str_len = length($str); my $gzip_len = length($gzip); my $hex_len = length($hex); my $base64_len = length($base64); # make binary printable ;-) $gzip = '#' x $gzip_len; printf "%3d: %s\n%3d: %s\n%3d: %s\n%3d: %s\n", $str_len, $str, $gzip_len, $gzip, $hex_len, $hex, $base64_len, $ba +se64; __DATA__ 39: Hello World! Hello World! Hello World! 36: #################################### 72: 1f8b0800000000000003f348cdc9c95708cf2fca495154f0c0c90100b9a8ae382 +7000000 29: QWxhZGRpbjpvcGVuIHNlc2FtZQ==

      You will see the value of compression as you increaase the string length.

      cheers

      tachyon