ryantate has asked for the wisdom of the Perl Monks concerning the following question:
Update: I guess I wasn't clear. This question is not "how can I download a lot of Web pages quickly" or "how do I cache a Web page" or "how do I check a Web resource most efficiently using HTTP." (Although I do appreciate the efforts made and answers issued along these lines.)
The question is given a collection of download times, how would you determine the best "typical" time to download a collection of Web pages. I offer my planned approach below, if you can think of something better, I'd love to hear about it. /Update
I have a script like the one described above, and do indeed run it once each morning. It takes 5-10 seconds to download the Web pages and parse (in the case of RSS/Atom feeds) or scrape (in the case of HTML) the pages and amalgamate the bits of info I care about into one page.
I recently got greedy and thought, "how can I make this even faster?" Like, make it run in 2 seconds or less.
I thought perhaps of caching the results of my script's HTTP fetches, so that subsequent runs of the script are faster. But since most of the Web pages change on a daily basis, and I rarely check more than once each day, and am the sole user (for now), this seemed like a waste of time. The cache would always be out of date.
The I realized I usually check at the roughly the same time every day. In a given 20 weekdays, I might check within 15 minutes of 8:30 am 15 times, closer to 6:30 am once, closer to 7:30 am once, and closer to 10 am three times.
The ideal time to pre-fetch those Web pages would probably be about 8 am -- early enough to be before 18 of the visits, so I hit the cache, and late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh.
I am presently thinking about a simple, rough approach -- take the last two weeks of downloads, compute a time that would come before 80 percent of the downloads, and subtract 30 minutes from that.
I have two concerns:
1. Am I reinventing the wheel? I have looked into tools like squid but do not believe I have found any existing toold, inside or outside the Perl world, to do what I want.
2. Is there a more flexible approach to be had without adding too much complexity or having to go back to university for a proper math/stats/cs/ai schooling (I do not program for a living)? I have looked at AI::Fuzzy* modules (see for example AI::FuzzyInference) but not played around with them yet.
Flexibility could help if my needs change. For example, say I start adding new collections of aggregated Web pages that I check more than once per day or less than once per day? Obviously I would need a more sophisticated system.
Or I might add another user who turns out to be much less predictable. It might be nice if the script could say, "bleh, you are too random, let's not pre-fetch at all and suck unneeded bandwidth."
After all, if I *only* want this pre-fetching to help just me in this one use scenario, I can just eyeball my own script invocation and pick a time (like 8 am) and implement the cache. I'd like to come up with something that can be fast for other people.
Any general thoughts appreciated. Obviously, I am not yet at the coding stage.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Predictive HTTP caching in Perl
by merlyn (Sage) on May 03, 2006 at 05:55 UTC | |
by ryantate (Friar) on May 03, 2006 at 06:12 UTC | |
Re: Predictive HTTP caching in Perl
by brian_d_foy (Abbot) on May 03, 2006 at 04:11 UTC | |
by ryantate (Friar) on May 03, 2006 at 05:53 UTC | |
by ryantate (Friar) on May 04, 2006 at 01:44 UTC | |
Re: Predictive HTTP caching in Perl
by kvale (Monsignor) on May 03, 2006 at 04:26 UTC | |
by ryantate (Friar) on May 03, 2006 at 05:57 UTC | |
Re: Predictive HTTP caching in Perl
by ioannis (Abbot) on May 03, 2006 at 04:29 UTC | |
by ryantate (Friar) on May 03, 2006 at 06:01 UTC | |
Re: Predictive HTTP caching in Perl
by ForgotPasswordAgain (Priest) on May 03, 2006 at 13:03 UTC | |
by ryantate (Friar) on May 03, 2006 at 14:33 UTC | |
Re: Predictive HTTP caching in Perl
by perrin (Chancellor) on May 03, 2006 at 16:44 UTC | |
by ryantate (Friar) on May 03, 2006 at 22:13 UTC | |
by perrin (Chancellor) on May 03, 2006 at 22:45 UTC |