hacker has asked for the wisdom of the Perl Monks concerning the following question:
I have a need to traverse a web tree remotely over http, parse a list of directories which come back, and grab the latest or second-to-latest that are displayed.
Once I have that, I need to fetch some files within that directory by name (which includes the date in the title of the filename).
For example, I will see something like this:
Parent Directory/ - Directory 20060922/ 2006-Nov-13 01:11:31 - Directory 20060927/ 2006-Nov-13 01:16:45 - Directory 20061016/ 2006-Dec-25 03:16:32 - Directory 20061103/ 2006-Dec-25 03:18:05 - Directory 20061202/ 2007-Jan-30 18:07:53 - Directory 20061224/ 2007-Feb-13 23:23:44 - Directory 20070126/ 2007-Mar-11 19:16:45 - Directory 20070208/ 2007-Feb-09 03:04:34 - Directory 20070225/ 2007-Feb-25 23:44:05 - Directory
From here, I can see that I want either
20070225or
20070208as the latest and second-to-latest directories in the tree.
Once I know this, I need to traverse into one of those directories and fetch a series of files, which have the date in the filename. These files are VERY enormous (tens of gigabytes in size)
What is the best approach to solve this problem, keeping in mind that this is over http, remotely, and the ability to resume aborted fetches is highly critical (ala wget -c).
Here is the order of events:
- Connect to directory resource and fetch html page that lists directories available
- Parse the list, sorting and retrieving the latest two most-recent directories
- Traverse into one or the other, starting with second-to-latest, and fetch file-$DATE-001.dat .. n, resuming where required from previous aborted fetches.
- Store locally, verifying full transfer, and delete any other local instances of previous directories that remain (thus keeping a "mirror" of only the latest two remote copies).
Which modules should I be exploring, other than the obvious LWP, WWW::Robot, File::Path, Date::Calc, Date::Manip and such?
Are there any canned routines or snippets somewhere that can help? Or in the absence of that, a tutorial that goes through some of this?
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by sgifford (Prior) on Mar 13, 2007 at 05:04 UTC | |
Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by ikegami (Patriarch) on Mar 13, 2007 at 03:37 UTC | |
by hacker (Priest) on Mar 13, 2007 at 03:45 UTC | |
Re: Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents
by Limbic~Region (Chancellor) on Mar 13, 2007 at 12:50 UTC |