glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

Here's my bright idea, I need to pass some data to a perl script that will return an image generated from it. Because the data will be embeded into the src tag (ie href='scrip.pl?data=whatever') I need to make sure I encode it as something appropriate for a URL, and I would like to compress it as well.

Right now I have:

use Convert::Base32 qw(); use Compress::Zlib qw(); $data=Convert::Base32::encode_base32(Compress::Zlib::compress($text));
So, first question - is Convert::Base32 the way to go? Or are there better alternatives?

Secondly, I am not getting great results with the compression - the data has a very small alphabet (it's basically DNA sequence - 'ACTG' - plus numbers, some whitespace, and a bit of markup - <> and []) so I was thinking it would get shrunk to very little; is there a more appropriate library I should be using?

Replies are listed 'Best First'.
Re: compress data to pass in url? (here is a simple encoder and decoder)
by grinder (Bishop) on Jan 17, 2003 at 17:48 UTC

    Well, at a first glance, you want to encode A C G T. 4 possible characters, that's nice, you can encode that in two bits. You could therefore stuff 3 characters in 6 bits, which means 64 separate characters.

    AAA => 0 AAC => 1 AAG => 2 AAT => 3 ACA => 4 ... TTT => 63

    Add those numbers to a reasonable offset, and we can coax nice ASCII characters out them with the help of chr and recover them with ord.

    #! /usr/bin/perl -w use strict; # build our encoder and decoder my $offset = 0; my %encode; my %decode; for my $x( qw/A C G T/ ) { for my $y( qw/A C G T/ ) { for my $z( qw/A C G T/ ) { $encode{"$x$y$z"} = $offset; $decode{$offset} = "$x$y$z"; $offset++; } } } # encode the string my $in = shift || 'AAAACTGACCGTTTT'; my $enc = ''; while( $in =~ /(...)/g ) { $enc .= chr( 48 + $encode{$1} ); } print "encoded: $enc\n"; # and back again my $dec = ''; for( split //, $enc ) { $dec .= $decode{ ord($_) - 48 }; } print "decoded: $dec\n";

    That almost works, as a proof of concept, except it contains some nasty encoded characters (as least as far as URLs are concerned) such as ? / ; and \. These are going to do Weird Things. In any event, it encodes the sample string AAAACTGACCGTTTT to 07QKo. You'll need to build up the encode and decode hashes from 0-1, a-z, A-Z and - and _ or some other safe characters (you want 64, remember?)

    Secondly, it contains a bad feature, in that the strings must be exact multiples three. What you would have to do is pad out the string with Cs or Ts or whatever (only two chars extra at most will be needed), and add a clip parameter that is equal to 0, 1 or 2, to tell how many chars to lop off a decoded string.

    This may or may not be of help, or use, but it was fun to write :)

    update: This was so much fun I couldn't stop thinking about it over dinner, and so here is a new and improved version. It offers URL-friendly encoding and it deals correctly with sequences that are not multiples of three (in the way I describe above). In re-reading the initial question, I see you want to use additional characters as well. In that case this scheme starts to fall apart. But you should be able to encode your markup in another parameter, and reference the parameter that holds this sequence in a templatish sort of way.

    #! /usr/bin/perl -w use strict; # build our encoder and decoder my @char = ( 0..9, 'a'..'z', 'A'..'Z', qw/- _/ ); my %encode; my %decode; for my $x( qw/A C G T/ ) { for my $y( qw/A C G T/ ) { for my $z( qw/A C G T/ ) { my $ch = shift @char; $encode{"$x$y$z"} = $ch; $decode{$ch} = "$x$y$z"; } } } # encode the string my $in = shift || 'AAAACTGACCGTTTT'; my $enc = ''; my $chunk; my $chunk_len; while( defined( $chunk = substr( $in, 0, 3, '' ))) { $chunk_len = length( $chunk ); last if $chunk_len < 3; $enc .= $encode{$chunk}; } # deal with leftover chunk if( $chunk_len ) { my $pad = 3 - $chunk_len; $enc = $pad . $enc . $encode{$chunk . 'A' x $pad}; } else { $enc = "0$enc"; } print "encoded: $enc\n"; # and back again my $dec = ''; my $pad = substr( $enc, 0, 1, '' ); $dec .= $decode{$_} for split //, $enc; $dec = substr( $dec, 0, -$pad ) if $pad; print "decoded: $dec\n";

    As an added bonus, there's no more need to deal with ord and chr (of course, that would have been possible in the initial version too).

    All those uses of the magic number 3 should probably be abstracted to a constant, but I better stop now.


    print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: compress data to pass in url?
by John M. Dlugosz (Monsignor) on Jan 17, 2003 at 17:02 UTC
    Come up with your own compression system that maps a couple consecutive DNA letters to one character. If your system uses URL-friendly characters only, there will be no need for base32. If there are only a few chars outside the legal URL range, then use URL escaping on just those, rather than puffing up the entire string.

    You could use a Huffman table encoding system with fixed tables that you design based on feeding it some typical data. I think a context-aware scheme would be able to compress better, though.

    How about posting the actual grammar? Maybe you'll get some concrete suggestions.

    —John

(jeffa) Re: compress data to pass in url?
by jeffa (Bishop) on Jan 17, 2003 at 17:13 UTC
    URI::Escape will handle the appropriate URL encoding, but having to do so seems a bit overkill ... have you looked into bioperl (in particular, Bio::Graphics) yet? Lots of wheels ready to go in that package.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      yes, I do use bioperl extensively (it's a life-saver infact), but in this case the actual image generation is trivial and not really biology specific.
Re: compress data to pass in url?
by hardburn (Abbot) on Jan 17, 2003 at 17:04 UTC

    Is there any reason why you can't use a POST instead of a GET? With POST, you wouldn't have to encode the data at all. You might still want compression, though.

      yes, I want to pass this data to a separate script from the display page so it generates the images on the fly, this saves me having to have temporary image files lying around.

      I other words, my script outputs an html page that has links to these "images" which the browser gets - the user doesn't do anything so there is no opportunity for a post.

        On the contrary, there are ways you can finagle this into POST. One thing I've learned in my tiny dabble into more-interesting-than-html webpages, is that you can use Javascript to do un-formy forms. Here's a concept snippet of what your html could look like:

        <form name="image1" id="image1" method="post" action="image_gen.pl"> <input name="button1" type="button" style="HEIGHT: 28px; WIDTH: 120px" + value="View Image 1" LANGUAGE="javascript" onclick="return button1 +_onclick()"> <input type="hidden" name="image_info" value=" $perl "> </form>
        ..where '$perl' is your info for that first image, retrieved from your post'ed form via whatever method you love best. The trick to making it work without real data to enter and turning that generic button into a submit button is in your headers javascript section:
        function button1_onclick(){ window.document.image1.submit(); }
        Now I'm completely lacking any real javascript knowledge, and I've done scant amounts of web programming, but that's the $3.50 I have to offer your Loch Ness Monster, as unperl as it is.
        Hope that helps,
        -=rev=-
Re: compress data to pass in url?
by Fletch (Bishop) on Jan 17, 2003 at 17:16 UTC

    Rather than passing the entire chunk of data back, you might use Cache::Cache or some variant thereof to keep the actual data on the server side and instead pass back a shorter key of some sort as the parameter (for exapmple, if the user's authenticated perhaps username-#####). When the request comes back, use the key to retreive the full data from the cache.

Re: compress data to pass in url?
by Anonymous Monk on Jan 17, 2003 at 18:35 UTC
    Another item to note is that if your web server is HTTP 1.0 only, there is a limit to the length of the URL. This limitation is also present in IE up to and including 5. I'd have to reccommend using the POST method.
Re: compress data to pass in url?
by Aristotle (Chancellor) on Jan 18, 2003 at 16:02 UTC
    Use URI::Escape instead of Convert::Base32. You're fine other with that, although you can spend hours coming up with more elaborate solutions.

    Makeshifts last the longest.

Re: compress data to pass in url?
by Anonymous Monk on Jan 17, 2003 at 22:27 UTC
    Just curious. What image processor are you using?
      All I need to do here is create a "thumbnail" view of the sequence and mark a few locations on it - GD.pm does just fine (though I am not sure if it's technically an image "processor")
Re: compress data to pass in url?
by osama (Scribe) on Jan 20, 2003 at 06:27 UTC

    You should know that encoding the data in text-readable format will increase its size,so you're basically losing the advantages of compression.

    how big is your data? how about a sample?

    how about using Mime::Base64? For binary data, Base32 would use 5 bits/character (60% Size Increase), Base64 uses 6 bits/character (33.33% Size Increase).