Given a script that downloads 20-30 fresh Web pages (text only, no images and not spidering links (update)) once each morning for one user, and logs the time it was run, how would you figure out the ideal time to pre-fetch those Web pages?

Update: I guess I wasn't clear. This question is not "how can I download a lot of Web pages quickly" or "how do I cache a Web page" or "how do I check a Web resource most efficiently using HTTP." (Although I do appreciate the efforts made and answers issued along these lines.)

The question is given a collection of download times, how would you determine the best "typical" time to download a collection of Web pages. I offer my planned approach below, if you can think of something better, I'd love to hear about it. /Update

I have a script like the one described above, and do indeed run it once each morning. It takes 5-10 seconds to download the Web pages and parse (in the case of RSS/Atom feeds) or scrape (in the case of HTML) the pages and amalgamate the bits of info I care about into one page.

I recently got greedy and thought, "how can I make this even faster?" Like, make it run in 2 seconds or less.

I thought perhaps of caching the results of my script's HTTP fetches, so that subsequent runs of the script are faster. But since most of the Web pages change on a daily basis, and I rarely check more than once each day, and am the sole user (for now), this seemed like a waste of time. The cache would always be out of date.

The I realized I usually check at the roughly the same time every day. In a given 20 weekdays, I might check within 15 minutes of 8:30 am 15 times, closer to 6:30 am once, closer to 7:30 am once, and closer to 10 am three times.

The ideal time to pre-fetch those Web pages would probably be about 8 am -- early enough to be before 18 of the visits, so I hit the cache, and late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh.

I am presently thinking about a simple, rough approach -- take the last two weeks of downloads, compute a time that would come before 80 percent of the downloads, and subtract 30 minutes from that.

I have two concerns:

1. Am I reinventing the wheel? I have looked into tools like squid but do not believe I have found any existing toold, inside or outside the Perl world, to do what I want.

2. Is there a more flexible approach to be had without adding too much complexity or having to go back to university for a proper math/stats/cs/ai schooling (I do not program for a living)? I have looked at AI::Fuzzy* modules (see for example AI::FuzzyInference) but not played around with them yet.

Flexibility could help if my needs change. For example, say I start adding new collections of aggregated Web pages that I check more than once per day or less than once per day? Obviously I would need a more sophisticated system.

Or I might add another user who turns out to be much less predictable. It might be nice if the script could say, "bleh, you are too random, let's not pre-fetch at all and suck unneeded bandwidth."

After all, if I *only* want this pre-fetching to help just me in this one use scenario, I can just eyeball my own script invocation and pick a time (like 8 am) and implement the cache. I'd like to come up with something that can be fast for other people.

Any general thoughts appreciated. Obviously, I am not yet at the coding stage.


In reply to Predictive HTTP caching in Perl by ryantate

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.