r1n0 has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks,

I am seeking information with respect to HTTP::Proxy capability to dish out files I have stored in a local cache. The context for the files being dished out is as follows:
1. Use a self written LWP agent to collect web resources and store the files to cache area (already completed)
2. Store headers of resources in same area (already completed)

I want to start with the following but add filters so I can push already stored content back to clients, otherwise, allow them to perform a passthrough and pull content from sites. I have no intention of caching content that I don't already have cached. If the client performs passthrough to site, the content is not stored at all.

use HTTP::Proxy; # initialisation my $proxy = HTTP::Proxy->new( port => 3128 ); # alternate initialisation my $proxy = HTTP::Proxy->new; $proxy->port( 3128 ); # the classical accessors are here! # this is a MainLoop-like method $proxy->start;

I have been testing HTTP::Proxy more and more and like how it works with complex sites, too. I already have tens of millions of resources in the cache, so I would prefer not to have to refetch all the resources. I have absolutely no experience setting up or writing HTTP::Proxy filters. So, if there is a way I can get this to work, it would be most appreciated.

Thanks in advance for your help.

Replies are listed 'Best First'.
Re: HTTP::Proxy filter for dishing out own files
by almut (Canon) on Oct 05, 2009 at 21:31 UTC

    Here's a snippet that will hopefully get you started:

    #!/usr/bin/perl use HTTP::Proxy; use HTTP::Proxy::HeaderFilter::simple; use Digest::MD5 qw(md5_hex); my $proxy = HTTP::Proxy->new( port => 3128 ); my $filter = HTTP::Proxy::HeaderFilter::simple->new( sub { my ($self, $headers, $request) = @_; my $uri = $request->uri(); my $cache_path = md5_hex($uri); # (depending on the file system being used, you might want to # have the cache files spread across several directory levels; # you'd of course need to have stored them that way in the fir +st place...) substr($cache_path, $_, 0) = "/" for (2,5,8); # prepend base path $cache_path = "/path/to/cache/$cache_path"; # debug printf "URI = %s\n", $uri; printf "host = %s\n", $uri->host(); printf "path = %s\n", $uri->path(); printf "query = %s\n", $uri->query(); printf "cache_path = %s\n", $cache_path; if (-s $cache_path) { # is it in the cache? # this would of course be the content read from the cache +file... my $content = "<html><body>...yadda yadda...</body></html> +"; # create response my $res = HTTP::Response->new(200); $res->content_type('text/html'); $res->content($content); # send back (short-circuit normal content fetching) $self->proxy()->response($res); } } ); $proxy->push_filter( request => $filter ); $proxy->start;

    For example, when fetching your PM node via that proxy, the debug output showing the request URI and cache path would look like:

    URI = http://perlmonks.org/?node_id=799308 host = perlmonks.org path = / query = node_id=799308 cache_path = /path/to/cache/0c/12/7f/026c96a135e45706194ba5b1f8

    And if you had stored associated content under the cache path 0c/12/7f/026c96a135e45706194ba5b1f8, it would be served instead of the real remote content...

      Thank you very much for the response! :-)

      I have a follow on question before I try what you sent. For the response code:
      if (-s $cache_path) { # is it in the cache? # this would of course be the content read from the cache +file... my $content = "<html><body>...yadda yadda...</body></html> +"; # create response my $res = HTTP::Response->new(200); $res->content_type('text/html'); $res->content($content); # send back (short-circuit normal content fetching) $self->proxy()->response($res); }

      I would like to know how I can assign the contents of an entire file to the HTTP::Reponse. The response will have the necessary header fields in it if I can just throw the entire file back to the client.
      I might need to through my own response code, but that will be easy enough.
      I would prefer not to have to read in all of the original headers and assign to each component of HTTP::Response, if I can avoid it.

      Thanks in advance for your help.