Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Predictive HTTP caching in Perl

by ryantate (Friar)
on May 03, 2006 at 03:51 UTC ( #547051=perlquestion: print w/replies, xml ) Need Help??

ryantate has asked for the wisdom of the Perl Monks concerning the following question:

Given a script that downloads 20-30 fresh Web pages (text only, no images and not spidering links (update)) once each morning for one user, and logs the time it was run, how would you figure out the ideal time to pre-fetch those Web pages?

Update: I guess I wasn't clear. This question is not "how can I download a lot of Web pages quickly" or "how do I cache a Web page" or "how do I check a Web resource most efficiently using HTTP." (Although I do appreciate the efforts made and answers issued along these lines.)

The question is given a collection of download times, how would you determine the best "typical" time to download a collection of Web pages. I offer my planned approach below, if you can think of something better, I'd love to hear about it. /Update

I have a script like the one described above, and do indeed run it once each morning. It takes 5-10 seconds to download the Web pages and parse (in the case of RSS/Atom feeds) or scrape (in the case of HTML) the pages and amalgamate the bits of info I care about into one page.

I recently got greedy and thought, "how can I make this even faster?" Like, make it run in 2 seconds or less.

I thought perhaps of caching the results of my script's HTTP fetches, so that subsequent runs of the script are faster. But since most of the Web pages change on a daily basis, and I rarely check more than once each day, and am the sole user (for now), this seemed like a waste of time. The cache would always be out of date.

The I realized I usually check at the roughly the same time every day. In a given 20 weekdays, I might check within 15 minutes of 8:30 am 15 times, closer to 6:30 am once, closer to 7:30 am once, and closer to 10 am three times.

The ideal time to pre-fetch those Web pages would probably be about 8 am -- early enough to be before 18 of the visits, so I hit the cache, and late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh.

I am presently thinking about a simple, rough approach -- take the last two weeks of downloads, compute a time that would come before 80 percent of the downloads, and subtract 30 minutes from that.

I have two concerns:

1. Am I reinventing the wheel? I have looked into tools like squid but do not believe I have found any existing toold, inside or outside the Perl world, to do what I want.

2. Is there a more flexible approach to be had without adding too much complexity or having to go back to university for a proper math/stats/cs/ai schooling (I do not program for a living)? I have looked at AI::Fuzzy* modules (see for example AI::FuzzyInference) but not played around with them yet.

Flexibility could help if my needs change. For example, say I start adding new collections of aggregated Web pages that I check more than once per day or less than once per day? Obviously I would need a more sophisticated system.

Or I might add another user who turns out to be much less predictable. It might be nice if the script could say, "bleh, you are too random, let's not pre-fetch at all and suck unneeded bandwidth."

After all, if I *only* want this pre-fetching to help just me in this one use scenario, I can just eyeball my own script invocation and pick a time (like 8 am) and implement the cache. I'd like to come up with something that can be fast for other people.

Any general thoughts appreciated. Obviously, I am not yet at the coding stage.

Replies are listed 'Best First'.
Re: Predictive HTTP caching in Perl
by merlyn (Sage) on May 03, 2006 at 05:55 UTC
      This looks like a useful module, thanks. Not totally in line with my question, but probably the right direction to head in all the same.
Re: Predictive HTTP caching in Perl
by brian_d_foy (Abbot) on May 03, 2006 at 04:11 UTC

    Is it really worth your time to save a couple of seconds?

    You could cache the data, then send conditional GET requests to fetch the resources whenever you need them. If the server tells you the page hasn't been updated, you use the cache. If it has been updated, the server sends you the new data and you use that.

    brian d foy <>
    Subscribe to The Perl Review
      That is exactly how I was planning to update the cache, but the question is concerning how I figure out *when* to update the cache.

      I'm also planning to use Keep-Alive in cases where I need multiple items from the same server, if that's of interest to you as well. But it's sort of beside the point.

      And yes, it would be worth it to save a couple of seconds, if I learned lessons that would let me implement such a cache for an arbitrary collection of Web pages for an arbitrary user. The difference between 1-2 seconds response time and 5-10 seconds makes all the difference in the world for a Web application.

      Obviously, with multiple users, the value of a conventional cache goes up. But I am interested in pre-fetching, so I reduced my question to the simplest case (which happens to be the only real one at the moment).

      Having been pointed at conditional GET a second time by perrin below, I want to say thanks for the link. I did plan to use it but did not understand how useful it could be, as one can ping servers more often with it than a conventional GET while still being considered well-mannered. Thanks.
Re: Predictive HTTP caching in Perl
by kvale (Monsignor) on May 03, 2006 at 04:26 UTC
    For RSS feed, there will be little or no content to cache, so I'd see this approach as a lot of work for uncertain benefit.

    Something that will work is parallelizing the retrieval of the pages/feeds. Create an application, say with Parallel::ForkManager, that creates multiple process, each one fetching one site and processing it. Then assemble the results from all the children into your composite feed. The time taken will be only a little longer than the slowest website/feed.


      Why do you say RSS feeds have no content to cache? There is the title, date and then, well, the content, either in the description element or content:encoded. And even in cases where it's just a title and a date, it takes time to open the connection and download the file.

      The benefit is: once cached, do not have to connect to server and download the Web page. When there are 30 pages, this is an issue.

      I'm already parallelizing the retrieval. I'm using LWP::Parallel after finding little additional speed benefit from either POE or HTTP::GHTTP with P::ForkManager.

      Thanks anyway.

Re: Predictive HTTP caching in Perl
by ioannis (Abbot) on May 03, 2006 at 04:29 UTC
    You could arrange to receive the replies in parallel, if it is worth your time. The lesson from this thread should be that the 'best' solution is not necesary the most technically complete solution. I would apt for a 3-line script that does the job most of time, than a complex mess that I must to fix as flaws are found.
      I am already downloading in parallel. Thanks though.
Re: Predictive HTTP caching in Perl
by ForgotPasswordAgain (Priest) on May 03, 2006 at 13:03 UTC

    I think you need to work on your specification more. You said something about averaging the last two weeks of downloads and putting the prefetch time 30 minutes before 80% of them. Why? You haven't specified. Why not just do it before all of those times? For that matter, why not prefetch at midnight the night before? There's presumably some constraint, like you need the latest content possible. If so, then you need to specify the maximum tolerable oldness of the content. Or the maximum average oldness. Then you need to specify whether you care if the user sometimes fetches un-prefetched content, or what percent of the time that's allowed to happen. The way you've presented it here, it seems to me that simply caching the first download would be sufficient, or as others suggested using a normal caching proxy like squid.

    I think after you've really specified the problem, the solution will probably fall out naturally. It seems to me, however, that you're less concerned about solving a problem than trying to find a problem. If so, then maybe studying up on math and AI really is what you want to do.

      Good questions.

      Why not do it at midnight? Freshness. Note the part where I say "late enough that the results are less than 45 minutes old 15 times, so the cache is really fresh."

      Many of the sources I read update during the night. Think of a page of online newspaper links, typically updated around 3 am in whatever time zone the newspaper is located. But I'm also mixing in blog feeds, updated on a less predictable schedule. So the goal is to cache as soon as possible before a likely visit.

      Simply caching -- with a conventional ttl scheme like you describe -- is, as I explained in my post, not going to cut it. Note I am not dealing with images or other static content that could live happily in such a cache -- or links to other pages, some of which maybe static -- only the text of the Web pages, most of which, again, change every single day.

      I appreciate your reply.

Re: Predictive HTTP caching in Perl
by perrin (Chancellor) on May 03, 2006 at 16:44 UTC
    Given that it only takes 5 seconds, I suggest you refresh the cache constantly every minute from 6am to 9am. Use if-modified-since (conditional GET) and you will not need to download things that don't change during that time.
      Wow ... well that's a concrete way to do it, so thanks. I just worry I'd be overloading the dozen-plus servers I check. For my app, I suppose every 15 minutes would be just as good. Thanks!
        If you use if-modified-since, and these are conforming web servers, they probably won't mind how often you hit them.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://547051]
Approved by GrandFather
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2022-11-29 10:29 GMT
Find Nodes?
    Voting Booth?