Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Compressing a string, but leaving it as a string

by hacker (Priest)
on Mar 11, 2005 at 20:49 UTC ( [id://438806]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I have a need to compress and encode a string of characters, in a way that will be easy to reverse later. I'm not trying to obfuscate the string, just make it easier to manipulate later. Turning the string into a series of numerals would be ideal, but alphanumeric would also work.

The string I'm dealing with is a series of param() values coming back from an HTML form. One example of the string looks like this:

/h/plkr/3/www.plkr.org/rss.pl

This breaks down into:

/h = Scheme (http in this case) /plkr = Format (Output format) /3 = Fetch limit /www.plkr.org/rss.pl = Feed url

I tried IO::Zlib, Compress::Zlib, Digest::MD4 and Digest::MD5, and others... in the hopes that I could compress the string, then encode it, but it still gives me an ascii string that is longer than the original input (not enough redundant characters to make compression worthwhile).

Is there a way to do something like this? The shortest I can get the encoded string is 45 characters, with a 30-character input string, using this code:

my $string = '/h/plkr/3/www.plkr.org/rss.pl'; my ($type, $format, $limit, $feed) = (split '/', $string, 5)[1..4]; my @tokens = split('/', $string); my $compressed = compress($string) ; my $encoded = encode_base64($compressed);

The reason why I'm trying to do this, is so that I can present this url as a value in a URL passed into my application later on, such as: index.pl?eJzTz9AvyMku0jfWLy8v1wMx9fKL0vWLiouBHACWdQpk

I'm trying to make it easier for the user to use and bookmark these URLs for later use with my application. I'm storing the unique URL in a database, but I can't store the depth and output format in the db, because hundreds of users could use the same feed url, but apply different depths or output formats to it, and these are random/anonymous users, so I can't store this in a user table in the db.

I could build a set of hashes that have numeric lookups for output format and scheme, but that doesn't really gain me much in terms of making the value sitting in the URI field any smaller.

Any useful hints or tips on how I can optimize this further?

Replies are listed 'Best First'.
Database lookup, or Huffman coding
by dave0 (Friar) on Mar 11, 2005 at 22:46 UTC
    If you're already storing the unique URL in a database, why not use a unique key from the db for the URL portion? That would let you turn
    my $string = '/h/plkr/3/www.plkr.org/rss.pl';
    into
    my $string = '/h/plkr/3/426933';
    where 426933 would be the unique ID for retrieving that URI from the database. It has the added bonus of letting the end user tweak things like depth and output format without having to decode/re-encode the string.

    If you really want to encode the string as-is, without having to store anything in a DB, you might be able to use something like Algorithm::Huffman in CPAN to generate a short bit vector for your data, and then use some other method (encode_base64 or somesuch) to make it safe for use in a URL. You'll need to grab a good sample of the sort of strings you want to encode, and do a frequency count on them so that Algorithm::Huffman can build up a good dictionary. Here's an untested example:

    my $token_counts = { '/' => 10000, '.' => 3000, 'w' => 2342, 'e' => 2000, 't' => 2000, # ... etc, for any character that might appear in the input strin +g }; my $string = '/h/plkr/3/www.plkr.org/rss.pl'; my $huff = Algorithm::Huffman->new($token_counts); my $huff_encoded = $huff->encode($string); my $printable = encode_base64($huff_encoded);
    I have no idea what sort of size savings this will get you, though. Base-64 encoding costs you one character of output for every 6 bits of input, so as long as your Huffman encoding saves you more bits than Base 64 encoding wastes, you should end up with a URL-safe encoding that's shorter than the input.
Re: Compressing a string, but leaving it as a string
by Limbic~Region (Chancellor) on Mar 12, 2005 at 00:13 UTC
    hacker,
    I'm not trying to obfuscate the string,...

    I couldn't resist. This is a 8bit <-> 7bit conversion utility so only ASCII 0 - 127 is supported.

    #!/usr/bin/perl use strict; use warnings; my $foo = compress( join '', 'a' .. 'z' ); print length $foo, "\n"; print grow( $foo ); sub compress {pack'b*',join'',map{sprintf'%b',ord}split//,pop} sub grow {join'',map{chr oct"0b$_"}grep$_,split/(.{7})/,join'',map{vec +"@_",$_,1}0..length("@_")*8-1}
    With a reduction of 12.5%, I doubt you really want to use it though.

    Cheers - L~R

      It seems the compress function cannot handle spaces and digits, etc.

      Should the sprintf not always print 7 digits? In that case, I suggest the following compress function:

      sub compress {pack'b*',join'',map{sprintf'%07b',ord}split//,pop}
        Anonymous Monk,
        Nice catch. As you could probably tell, I posted mostly as a joke and didn't use proper sample data in my testing. I wouldn't use this in any production code.

        Cheers - L~R

Re: Compressing a string, but leaving it as a string
by TedPride (Priest) on Mar 12, 2005 at 04:14 UTC
    MD5 hash your data. The data can be as long as you want. Then store the data as a database record using the MD5 hash as key. The user can then access a URL like this:
    index.pl?eJzTz9AvyMku0jfWLy8v1wMx9fKL0vWLiouBHACWdQpk
    ...and your script will search for the database record with that key and return the data in its original form.

    This was already said above, but I thought I'd explain in more detail.

Re: Compressing a string, but leaving it as a string
by dragonchild (Archbishop) on Mar 11, 2005 at 21:00 UTC
    What's wrong with generating a random MD5 and living with the fact that you're going to be 32 chars long? That's a standard practice. It's not till you start to hit 2048 characters that you run into an IE bug, so use a few characters. It's not even your space you're wasting! :-)

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      "What's wrong with generating a random MD5 and living with the fact that you're going to be 32 chars long?"

      A random MD5 doesn't help, because I can't take that MD5 string and reverse it back to the original tokens.

        Use the MD5 as a key into a DB file on the server side that has the real data saved using Storable or YAML.

Re: Compressing a string, but leaving it as a string
by perlfan (Vicar) on Mar 12, 2005 at 01:01 UTC
    You can use a Huffman encoding scheme.

    Here is an example I wrote a while back:

    http://www.brettsbsd.net/~estrabd/SCHOOL/CS638/huffman.php

    You'll have to modify it because it because it just returns a the 1's and 0's as a string. It also doesn't handle 1's and 0's when decoding a string that originally contained 1's and 0's...I intend on fixing these issues when I have time, but it does *most* of what you might want...
Re: Compressing a string, but leaving it as a string
by CountZero (Bishop) on Mar 11, 2005 at 21:45 UTC
    I would just drop the compression part and only do the encoding. It wouldn't make much difference for the length of the URL.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://438806]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-16 15:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found