http://qs321.pair.com?node_id=15504

NeverMore has asked for the wisdom of the Perl Monks concerning the following question:

How would I go about making a perl script that will connect to an HTTP site; search a certain directory, and all subdirectories for a certain type of file (*.txt files for example); and download all of those files? It's not possible to FTP to the site, by the way.

-NM

Replies are listed 'Best First'.
RE: Subdir globs
by Ozymandias (Hermit) on May 31, 2000 at 01:17 UTC
    btrott has a good point, but you *could* combine elements of a search engine with this script. Basically, you would crawl the site (beginning with the homepage) checking all linked pages within the domain and downloading those with the target extension.

    I'd start with one of the search engine/site crawlers that are available, and modify it as I go. Basically, you're looking for any link (so search the returned code of the front page for "<a href=" and work from there to find new search pages. But have it check each link for the extension you're looking for; in this case, when your code pulls out the URL of the link (however you have it do that) check it for the extension:

    if(/.txt/i) { ...

    - Ozymandias

Re: Subdir globs
by btrott (Parson) on May 31, 2000 at 00:42 UTC
    Using HTTP you can't reliably get listings of all files in a directory. Are you talking about a listing like "ls" would give you? You can't get that over HTTP, unless you're implementing both ends of the request--in other words, unless you've implemented some server-side script to print out a directory listing for a given directory.

    In any case, look into LWP for making HTTP requests--you can use this to fetch the files, once you know what files you'd like to fetch.

Re: Subdir globs
by plaid (Chaplain) on May 31, 2000 at 01:13 UTC
    Take a look at wget, located at http://www.gnu.org/software/wget/wget.html. Excerpt from that page:

    The recursive retrieval of HTML pages, as well as FTP sites, is supported: you can use Wget to make mirrors of archives and home pages, or traverse the web like a WWW robot (Wget understands /robots.txt)

Re: Subdir globs
by antihec (Sexton) on May 31, 2000 at 01:19 UTC
    You might want to have a look into lwp-rget from the libwww package;
    Adding a $SUFFIX option around line 269 shouldn't be too complicated.

    -- bash$ :(){ :|:&};:
Re: Subdir globs
by NeverMore (Acolyte) on May 31, 2000 at 02:12 UTC
    The thing is, in each directory containing these files, there is an index file. I don't know the name of the index file (i'm actually searching for *.tab, *.btab, *.crd, and *.pro files at http://www.nutz.org/olga/main). I've tried educated guesses such as index.pl, index.cgi, index.php, index.php3, index.shtml, etc. None of them have worked. Now, the index lists and links all of the subdirectories and files in the current directory. If I can find out the name of the index file, I can probably make a script that will scan the html code for links to a subdirectory and/or *.tab file and either switch to the index page of that directory or download the file.

    -NM

      Looking at that site I see that it is organized in a 3 level heirarchy. What you'll probably want to do is something along the lines of this:
      get the main page
      foreach subpage
        get subpage
        foreach sub-sub page
          get sub-sub page
          parse out any get any files you're interested in
      
Re: Subdir globs
by chromatic (Archbishop) on Jun 06, 2000 at 23:43 UTC
    You might have a look at LWP::RobotUA. It's a really good place to start if you want to build a robot. That follows the recommendation of Ozymandius.
Re: Subdir globs
by lhoward (Vicar) on May 31, 2000 at 01:18 UTC
    btrott is right. However, if the site you are interested in is configured to return a directory listing if there is no index.html (or whatever the site is configured to show as the home for a directory) then you can just do a get of the URL and parse the output to seeif it has any files you would be interested in.

    This technique is very dependant on server configuration (and it sounds like you're looking for a client-only solution); so you may be out of luck.