in reply to Re^2: URL string compression?
in thread URL string compression?

If you have a case sensitive system, that's 64 safe characters. If you can compressed (by packing numbers or otherwise) the data down to 33 bytes (floor((90/2) * (log2(64)/8))), you could use the following to convert to safe characters:
use MIME::Base64; sub encode { my ($compressed) = @_; my $encoded = encode_base64($compressed); $encoded =~ s{\+}{-}g; $encoded =~ s{\/}{_}g; return $encoded; } sub decode { my ($encoded) = @_; $encoded =~ s{-}{+}g; $encoded =~ s{_}{/}g; my $compressed = decode_base64($encoded); return $compressed; }

Update: On second thought, if people are gong to save these files on their own PCs, you'll need to be case-insensitive. That leaves 38 safe characters. If you wrote Base32 based on Base64 (a simple task), you'll have to compress the data down to 28 bytes (floor((90/2) * (log2(32)/8))).

Update: Fixed attrocious math.

Replies are listed 'Best First'.
Re^4: URL string compression?
by punch_card_don (Curate) on Feb 13, 2006 at 23:52 UTC
    Original filename was 125 characters.

    "Compressed" filename is 175 characters.




    Forget that fear of gravity,
    Get a little savagery in your life.

      Sorry, I wasn't clear.

      You have two problems. The first is compression. The second is encoding the compressed result into safe characters. I was addressing the latter problem.

      If you use my suggested encoding method, you first need to compress your data down to 28 bytes. Base32 will convert your compressed data into 45 (ceil(28*(8/log2(32)))) safe characters.

      What information is contained in the original file name? Consistently compressing by 78% (1 - ceil(28/125)) will be hard, and will only be possible with intimate knowledge the data to compress.

        OH, OK. Encode is for AFTER we've compressed to make sure that the compressed filename is made up of only "safe" characters.

        OK, but now that compression...

        The filename is made up of codes that tell what is in the file. Fortunately, the structure is well defined. The order of the codes is fixed. But they may be of varying lengths and composition.

        For example:

        2006-asdf-qwerty-123_456_789-

        where we know in advance that the filename field separator is a dash, the first code is the year the file was created, the second is a code for the subject, the third is a code for the author, and the fourth will be references to chemicals mentioned in the work, but this list can have a varying number of members. For xample.




        Forget that fear of gravity,
        Get a little savagery in your life.