in reply to Encoding/compress CGI GET parameters

Encoded URL

First, define two routines that can do the grunt work of crushing and uncrushing your parameters, which for the sake of argument are being stored in a HASH:
use MIME::Base64; sub Crush { return shift(@_).MIME::Base64::encode (join ("\x00", @_), ""); } sub Uncrush { my ($q) = shift; return split (/\x00/, MIME::Base64::decode($q->path_info())); }
This is, of course, assuming you don't have any NULL (ASCII 0) characters in your data. If you do, this code will break, but it will work fine on normal ASCII text. Additionally, it uses "path_info()" from CGI.pm, which you should be using anyway.

You could tie them into your program like so:
my (%data) = ( 'x' => 'y' ); # etc. $url = Crush ("http://www.xyzco.com/foo.cgi?",%data); # Or, on the receiving end... my ($q) = new CGI; my (%data) = Uncrush ($q);
It all ends up as a great big pile of goo as far as the user is concerned, but it isn't encrypted to any great degree. If you wanted, you could MD5 encrypt it, PGP it, or whatever strikes your fancy, before MIME::Base64::encode(), with the opposite on the receiving end, of couse.

Server Side Data

An intelligent alternative to this "encoding" is to keep the data on the server. As you mentioned, long URLs are a problem for some e-mail programs, and certainly more users. To keep the URL to an absolute minimum, you could store all of the data in a database on the server side and pass only a key to the client.

Basically, your URL would contain a text key like "AxZLkFlG" which is a randomly created string that the server would use to identify that session. You could then store all of your data server side.

The downside to this approach is that the data has to be preserved for extended periods of time, because if the server data is "expired", the URL becomes virtually useless. If you expect the users to re-visit six or eight months from now, that would translate to a six or eight month history of data, which can get quite large, depending on your application.

Additionally, if a user sends a copy of the URL to five friends, they will all be modifying the same database entry, which can lead to some unsavory variable "bleed" between their sessions. This can be very dangerous, especially for e-commerce applications.

If you have no idea when the user is going to re-visit, and you want to preserve the state of the program indefinitely, you have to pack all the data into the URL. Base64 expands the content moderately, so the URLs will always be longer using this method, but this can be minimized if you compress it before encoding (i.e. LZW encoding, like that used in gzip).

Replies are listed 'Best First'.
Re: Re: Encoding/compress CGI GET parameters
by snellm (Monk) on Jan 17, 2001 at 20:47 UTC

    Hi tadman,

    That's a pretty good summary, but it leaves my original question:

    What is the best way to Crush (to use your sub name) CGI parameters? MIME:Base64 is not suitable because it increases the length of the URL, I want to decrease it. I think a solution would have to take advantage of the format of CGI parameters.

    Something that just popped into my head:

    A scheme as mentioned by Dave, but instead of storing the whole parameter string against a unique ID, store the parameter names, order and format (string/integer). Then encode the URL as the ID, followed by the parameter values encoded depending on format.

    For example, if the hash contains

    Key: 1 Value: Action=<string>,Area=<int>,SubArea=<int>

    Then the URL:

    http://www.server.com/cgi-bin/script/script.pl?Action=view&Area=12345& +SubArea=12345

    Could be encoded to:

    http://www.server.com/cgi-bin/script/script.pl/SKLJSD

    where "SKLJSD" can be decoded to 1,view,12345,12345

    Comments? This avoids the problem of having to expire hash entries, because the hash contains only formats, which are likely to be a fairly small set.

    -- Michael Snell
    -- michael@snell.com

      If you're feeling ambitious, which it sounds like you are, you can always compact your data before sending it. Consider using pack() on your data to reduce the size, and then possibly MIME encoding it to handle the encoding for the URL. Base64 is good for your application since it is fully e-mail compatible.

      UTF-5 is also a possibility, and it is used to "encode" UNICODE for DNS purposes, mapping two-byte characters into the very limited DNS space A-Za-z0-9-. Fortunately, there is a little more "bitwidth" in the URL specification, something that could be better exploited with careful analysis and testing.

      Instead of having a parameter like "mode=view" or "mode=edit", consider using an ENUM() type parameter, where you have a table of modes and their associated "tiny" values. As long as you have a small number of variations, there is no need to report the entire thing verbatim. A single byte can carry a lot of information, as long as the context of this byte is understood.
      my (@possible_values) = qw(view edit modify delete nuke); my (%possible_values) = do { my $n; map { $_, $n++ } @possible_values; + }; $encoded_param = $possible_values{'mode'}; $decoded_param = $possible_values[$encoded_param];


      Numbers, likewise, can be squished into "packed binary" which can reduce 10-digit numbers into 4-byte values, or about 6-bytes after Base64, which is a moderate but valuable decrease.

      Here's a compactor that I just sketched out. Use for entertainment purposes only, as it is untested. It takes in a SCALAR and returns a squished up version with a type identification byte which can be used to desquish it properly later.
      sub Squish { my ($what) = @_; if ($what =~ /^\-?[0-9]+$/) { if ($what >= 0) { if ($what <= 255) { return pack ("CC", 0x01, $what); } elsif ($what <= 65535) { return pack ("CS", 0x02, $what); } elsif ($what <= 4294967295) { return pack ("CL", 0x04, $what); } } elsif ($what >= -128 && $what <= 127) { return pack ("Cc", 0x09, $what); } elsif ($what >= -32768 && $what <= 32767) { return pack ("Cs", 0x0A, $what); } elsif ($what >= 2147483648 && $what <= 2147483647) { return pack ("Cl", 0x0B, $what); } } elsif ($what =~ /^\-?[0-9]+(?:\.[0-9]+)?(?:e[\+\-]\d+)?$/) { return pack ("Cd", 0x0C, $what); } elsif (length ($what) < 16) { return pack ("C", 0x0C & (length($what) << 4)).$what; } elsif (length ($what) <= 255) { return pack ("CC", 0x0D, length($what)).$what; } return pack ("CS", 0x0E, length($what)).$what; }