in reply to URL string compression?

What characters are valid in the uncompressed filename?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: URL string compression?
by punch_card_don (Curate) on Feb 13, 2006 at 23:34 UTC
    Currently using only letters, any case; numbers; dashes, underscores. That's it.

    But other characters would be acceptable if it makes compressin possible.





    Forget that fear of gravity,
    Get a little savagery in your life.
      If you have a case sensitive system, that's 64 safe characters. If you can compressed (by packing numbers or otherwise) the data down to 33 bytes (floor((90/2) * (log2(64)/8))), you could use the following to convert to safe characters:
      use MIME::Base64; sub encode { my ($compressed) = @_; my $encoded = encode_base64($compressed); $encoded =~ s{\+}{-}g; $encoded =~ s{\/}{_}g; return $encoded; } sub decode { my ($encoded) = @_; $encoded =~ s{-}{+}g; $encoded =~ s{_}{/}g; my $compressed = decode_base64($encoded); return $compressed; }

      Update: On second thought, if people are gong to save these files on their own PCs, you'll need to be case-insensitive. That leaves 38 safe characters. If you wrote Base32 based on Base64 (a simple task), you'll have to compress the data down to 28 bytes (floor((90/2) * (log2(32)/8))).

      Update: Fixed attrocious math.

        Original filename was 125 characters.

        "Compressed" filename is 175 characters.




        Forget that fear of gravity,
        Get a little savagery in your life.

      As you've seen, with 64 characters in the input, that 90*6-bits = 67.5 (mostly unacceptable) 8-bit chars as your best "simple transform' compression. A bare 2/3rds compression, even if all the 8-bit chars were acceptable in a filename which the aren't.

      Your best hope is if your filenames can be split into various fields that can be represented by a number that is shorter than the fields text representation. For example: if one component of the name was one of 'North', 'NorthEast', 'East', 'SouthEast', 'South', 'SouthWest', 'West', 'NorthWest', that same field could be replaced by a digit 0-7, or maybe just 4-bits in conjunction with some other field with upto 3-bits.

      Without seeing examples of the filenames, and the range of values the fields within represent, it's hard to be more helpful.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      Edit: g0n - reparented at authors request