in reply to Imploding URLs

Call me crazy, but aren't there already highly efficient methods for compressing files efficiently, as in, gzip?


($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re^2: Imploding URLs
by mobiGeek (Beadle) on Jun 10, 2005 at 02:59 UTC
    You are crazy. :-)

    So here's the bigger gist. I am improving a special-purpose HTTP proxy server that rewrites URLs in pages that it fetches so that they all point back to itself (e.g. the URL "http://www.yahoo.com/" gets rewritten as "http://my_proxy?url=http://www.yahoo.com/". So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis. GZip and the like don't do very much on a single URL.

    Finding the collection of "top" substrings has already reduced my downloads by 20% on a given page, but that was done by hand for a single test page with only 30 or so URLs in it.

    So the problem as stated stands...I wish it were as simple as GZip/Compress. In fact, I used those and in many cases the URLs are actually larger (for short URLs)...especially once the data is encrypted and base64'ed...

    mG.

      So though I have a large collection of URLs (from my logs), I need to "implode" URLs on a one-by-one basis.

      Why? Is the space savings that significant?

      If all you have is a handful of substitutions, you can probably hand-pick the strings:

      http www. .com .org .net :// index .htm .jpg .gif google yahoo mail news ebay
        Yes, the savings is quite significant. From a hand-selected list of one user's habits, I was able to reduce some pages by more than 20%.

        The other thing is that this is not a collection of URLs from across the entire web. The URLs being crawled vary, but the proxy is part of a kind of "portal". So there are potentially thousands of URLs, but they come from a select list of sites. Thus the reason I am looking for the weighting of substrings.

        If one URL or one particular site (i.e. a particular substring) is crawled extremely frequently, then imploding that string might be much more bandwidth saving than simply imploding "http://" on all URLs.

        mG.