Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Web Page Expiry

by jai_dgl (Beadle)
on Nov 11, 2008 at 15:44 UTC ( [id://722878]=perlquestion: print w/replies, xml ) Need Help??

jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I use LWP::Agent to get a web page content,
my requirement is to get the same page often (say 2 or 3 times a day).
every time I need to check whether the page is updated or not.
If its updated, I need to fetch the current page. For the first time I store it in a local file.
I just need to compare existing file and page content.
Is there any simple way to compare this ?

Replies are listed 'Best First'.
Re: Web Page Expiry
by kyle (Abbot) on Nov 11, 2008 at 15:55 UTC

    Maybe LWP::UserAgent::WithCache would do what you want.

    Generally, you should stat the local copy of the page to get its modification time. Pass that through as a header ("If-Modified-Since") in your request to the web server. The server should be able to check the date and either offer up the full page a short "no change" message.

Re: Web Page Expiry
by moritz (Cardinal) on Nov 11, 2008 at 15:52 UTC
    If you are only interested if the web page was changed, and not how, you can use File::Compare (core since Perl 5.004).
Re: Web Page Expiry
by MidLifeXis (Monsignor) on Nov 11, 2008 at 16:51 UTC

    LWP::Simple::mirror()

    --MidLifeXis

      Hi, I used the mirror functionality of LWP::Simple.
      But it doesn't compare the file content. It checks for the last modified time with the local cache file.
      If the Last modified time in the server and the local file, it rewrites the local file.
      My requirement is there any module to compare the local file content and the server page content.

      thx

        Usually on the web, the If-modified-since solution (used behind the scenes by LWP::Simple::mirror) is the preferred solution. However, if that solution is not reliable (and therefore not reliable for others either), you will probably need to fetch the file and compare them. You could probably start your search here on cpan if you want to use a Perl solution. If you don't limit yourself to Perl, there are other OS-specific tools that you can use (like diff or rdist).

        On the other hand, if all you are looking to do is see if the file needs to be updated locally, and you have to retrieve the file to determine that anyway, why not just update the file.

        As an alternative, is there a checksum (MD5, etc) file generated for the file on the remote server? If so, you could retrieve that instead (in theory it should be smaller), and compare them to determine if you need to download the real file.

        I would also see if you can work with the source site to get their timestamps correct for the mirror process to work. That is the RightWay™ to do it. If this does not work, then any cache (your workstation's local cache, company, isp, accelerators on the remote side, etc) in the way can hose your checks anyway.

        --MidLifeXis

Re: Web Page Expiry
by pjotrik (Friar) on Nov 11, 2008 at 16:27 UTC
    you may remember when you first accessed the document (or stat a saved file) and issue a HEAD request for the page for the successive attempts. For example, LWP::Simple::head() returns among others the last modified time.
Re: Web Page Expiry
by JavaFan (Canon) on Nov 11, 2008 at 15:49 UTC
    diff(1)

    There's also a Perl module implementing a similar algorithm.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://722878]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-25 12:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found