Subdir globs

NeverMore has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
RE: Subdir globs by Ozymandias (Hermit) on May 31, 2000 at 01:17 UTC
btrott has a good point, but you could combine elements of a search engine with this script. Basically, you would crawl the site (beginning with the homepage) checking all linked pages within the domain and downloading those with the target extension. I'd start with one of the search engine/site crawlers that are available, and modify it as I go. Basically, you're looking for any link (so search the returned code of the front page for "<a href=" and work from there to find new search pages. But have it check each link for the extension you're looking for; in this case, when your code pulls out the URL of the link (however you have it do that) check it for the extension: `if(/.txt/i) { ...` [download] - Ozymandias	[reply] [d/l]
Re: Subdir globs by btrott (Parson) on May 31, 2000 at 00:42 UTC
Using HTTP you can't reliably get listings of all files in a directory. Are you talking about a listing like "ls" would give you? You can't get that over HTTP, unless you're implementing both ends of the request--in other words, unless you've implemented some server-side script to print out a directory listing for a given directory. In any case, look into LWP for making HTTP requests--you can use this to fetch the files, once you know what files you'd like to fetch.	[reply]
Re: Subdir globs by plaid (Chaplain) on May 31, 2000 at 01:13 UTC
Take a look at wget, located at http://www.gnu.org/software/wget/wget.html. Excerpt from that page: The recursive retrieval of HTML pages, as well as FTP sites, is supported: you can use Wget to make mirrors of archives and home pages, or traverse the web like a WWW robot (Wget understands /robots.txt)	[reply]
Re: Subdir globs by antihec (Sexton) on May 31, 2000 at 01:19 UTC
You might want to have a look into lwp-rget from the libwww package; Adding a $SUFFIX option around line 269 shouldn't be too complicated. `-- bash$ :(){ :\|:&};:` [download]	[reply] [d/l]
Re: Subdir globs by NeverMore (Acolyte) on May 31, 2000 at 02:12 UTC
The thing is, in each directory containing these files, there is an index file. I don't know the name of the index file (i'm actually searching for .tab, .btab, .crd, and .pro files at http://www.nutz.org/olga/main). I've tried educated guesses such as index.pl, index.cgi, index.php, index.php3, index.shtml, etc. None of them have worked. Now, the index lists and links all of the subdirectories and files in the current directory. If I can find out the name of the index file, I can probably make a script that will scan the html code for links to a subdirectory and/or *.tab file and either switch to the index page of that directory or download the file. -NM	[reply]
RE: Re: Subdir globs by lhoward (Vicar) on May 31, 2000 at 03:29 UTC
Looking at that site I see that it is organized in a 3 level heirarchy. What you'll probably want to do is something along the lines of this: get the main page foreach subpage get subpage foreach sub-sub page get sub-sub page parse out any get any files you're interested in	[reply]
Re: Subdir globs by chromatic (Archbishop) on Jun 06, 2000 at 23:43 UTC
You might have a look at LWP::RobotUA. It's a really good place to start if you want to build a robot. That follows the recommendation of Ozymandius.	[reply]
Re: Subdir globs by lhoward (Vicar) on May 31, 2000 at 01:18 UTC
btrott is right. However, if the site you are interested in is configured to return a directory listing if there is no index.html (or whatever the site is configured to show as the home for a directory) then you can just do a get of the URL and parse the output to seeif it has any files you would be interested in. This technique is very dependant on server configuration (and it sounds like you're looking for a client-only solution); so you may be out of luck.	[reply]


"be consistent"
	PerlMonks