|XP is just a number|
Predictive HTTP caching in Perlby ryantate (Friar)
|on May 03, 2006 at 03:51 UTC||Need Help??|
ryantate has asked for the wisdom of the Perl Monks concerning the following question:
Given a script that downloads 20-30 fresh Web pages (text only, no images and not spidering links (update)) once each morning for one user, and logs the time it was run, how would you figure out the ideal time to pre-fetch those Web pages?
Update: I guess I wasn't clear. This question is not "how can I download a lot of Web pages quickly" or "how do I cache a Web page" or "how do I check a Web resource most efficiently using HTTP." (Although I do appreciate the efforts made and answers issued along these lines.)
The question is given a collection of download times, how would you determine the best "typical" time to download a collection of Web pages. I offer my planned approach below, if you can think of something better, I'd love to hear about it. /Update
I have a script like the one described above, and do indeed run it once each morning. It takes 5-10 seconds to download the Web pages and parse (in the case of RSS/Atom feeds) or scrape (in the case of HTML) the pages and amalgamate the bits of info I care about into one page.
I recently got greedy and thought, "how can I make this even faster?" Like, make it run in 2 seconds or less.
I thought perhaps of caching the results of my script's HTTP fetches, so that subsequent runs of the script are faster. But since most of the Web pages change on a daily basis, and I rarely check more than once each day, and am the sole user (for now), this seemed like a waste of time. The cache would always be out of date.
The I realized I usually check at the roughly the same time every day. In a given 20 weekdays, I might check within 15 minutes of 8:30 am 15 times, closer to 6:30 am once, closer to 7:30 am once, and closer to 10 am three times.
The ideal time to pre-fetch those Web pages would probably be about 8 am -- early enough to be before 18 of the visits, so I hit the cache, and late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh.
I am presently thinking about a simple, rough approach -- take the last two weeks of downloads, compute a time that would come before 80 percent of the downloads, and subtract 30 minutes from that.
I have two concerns:
1. Am I reinventing the wheel? I have looked into tools like squid but do not believe I have found any existing toold, inside or outside the Perl world, to do what I want.
2. Is there a more flexible approach to be had without adding too much complexity or having to go back to university for a proper math/stats/cs/ai schooling (I do not program for a living)? I have looked at AI::Fuzzy* modules (see for example AI::FuzzyInference) but not played around with them yet.
Flexibility could help if my needs change. For example, say I start adding new collections of aggregated Web pages that I check more than once per day or less than once per day? Obviously I would need a more sophisticated system.
Or I might add another user who turns out to be much less predictable. It might be nice if the script could say, "bleh, you are too random, let's not pre-fetch at all and suck unneeded bandwidth."
After all, if I *only* want this pre-fetching to help just me in this one use scenario, I can just eyeball my own script invocation and pick a time (like 8 am) and implement the cache. I'd like to come up with something that can be fast for other people.
Any general thoughts appreciated. Obviously, I am not yet at the coding stage.