Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Downloading continous updates from webpage

by avid (Novice)
on Feb 16, 2006 at 02:42 UTC ( [id://530578]=perlquestion: print w/replies, xml ) Need Help??

avid has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to download data from a page that takes little more than one hour to finish after submitting the list of input data. Incremental updates are reflected to the browser. Finally a summary is printed at end, and the browser corner icon stops revolving to let user know it is done.

How do I capture such incremental updates? Any suggestion or code example will be greatly appreciated.
  • Comment on Downloading continous updates from webpage

Replies are listed 'Best First'.
Re: Downloading continous updates from webpage
by vladb (Vicar) on Feb 16, 2006 at 03:02 UTC
    I'm not clear, are you trying to build a page that would display download progress in such fashion.. or are you instead trying to download a large file, for example?

    There are many tools out there to aid in downloading large files off the web. If you are using the Firefox browser, you may find some of the download extensions useful as well.

    But if you are trying to build a script to fetch files the Bundle::LWP module could help as is also explained in this post. Whereas this post also explains how to download multiple files at once.


    _____________________
    "We've all heard that a million monkeys banging on a million typewriters will eventually reproduce
    the entire works of Shakespeare. Now, thanks to the Internet, we know this is not true."

    Robert Wilensky, University of California

Re: Downloading continous updates from webpage
by BrowserUk (Patriarch) on Feb 16, 2006 at 03:42 UTC

    When you say "incremental updates", does each refresh contain all the preceeding information?

    If so, you probably only need the final page, which from your description should be easy to detect because of the presence of summary information.

    Presumably the intermediate pages displayed in the browser are fetched as a result of a meta refresh tag or javascript refresh every few minutes? When automated, you wouldn't need the autorefreshes as you are only going to discard them, but it may be necessary to fetch them anyway as the server may decide to cancel the processing if it doesn't see a refresh request at regular intervals.

    Depending upon the complexity of the page and the refresh mechanism used, you might get away with using LWP::Simple to get or put the url successively (at appropriately timed intervals), scanning the content returned and discarding it until it contains the summary information.

    In more complex cases, you may need to scan the content returned by the first submit and extract the refresh url from embedded javascript. It may even be necessary to rescan every partial content returned page to extract a different url.

    It might be easier to use WWW::Mechanize, though I'm not sure that it copes with embedded javascript refreshes?

    Providing a code example is pretty much impossible without seeing the pages involved. If the url is public, you could post it, (or /msg it to a willing responder if you don't want to overtax the server), and you might get a worked example.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      When you say "incremental updates", does each refresh contain all the preceeding information?

      From what the poster said, I think there's no refreshing of the page at all.
      I think the server is just printing stuff and the browser renders what it cans before the whole page is done downloading. This works kind of well is some scenarios and even better if you turn on autoflush on the server side.

      However there are some catches. E.g. AFAIK, IE will only render a table after it gets the closing tag. And possibly some more of these kind of glitches.


      acid06
      perl -e "print pack('h*', 16369646), scalar reverse $="
Re: Downloading continous updates from webpage
by Ultra (Hermit) on Feb 16, 2006 at 06:45 UTC

    By incremental updates you mean that your HTTP server is accepting Range header? --> so that you can ask for pieces of data

    Dodge This!
      Thanks to you all for this prompt responses. >>>>>>browser renders what it cans before the whole page is done downloading. I guess this is the case. I cannot do multiple reads to get incremental updates, as the post also contains input data that will get resubmitted. Is there any timeout in LWP POSTs? If none, I can just do a POST and then check results, the script can just wait whatever time it takes server to calculate. I will be reading the WWW:Mechanize man pages and check if it can solve my problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://530578]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2024-04-25 21:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found