Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Remote Directory listing

by johnfl68 (Scribe)
on Jul 07, 2012 at 20:57 UTC ( [id://980518]=perlquestion: print w/replies, xml ) Need Help??

johnfl68 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all:

I have been looking for examples to retrieve a list of image files from a remote directory (http - standard apache index) to load into an array so I can get the 10 most recent and copy them to my server (via getstore). There are typically only 20-40 images in this directory at any given time that are being rotated out. There are other things that I am doing after that with imagemagick, but I have that handled.

I have found a few examples, but they require robot or scraping modules, I am on a shared server and I am trying to do this with the modules that are installed. I am not really scraping, as I am only hitting the one remote server, and have permission to do so.

I am usually good at getting things to work if I have good examples to go by, but I've been searching for a week now and not having much luck. Worse case, the files follow a certain format, with a date and time in the file name (they are just not sequential). I could always do a loop starting from "now" and do a getstore for every possible name until I retrieve 10 existing files, but I would think there has to be some way to get a list of files in a remote directory without getting too carried away.

Thank you for the help!

John

Replies are listed 'Best First'.
Re: Remote Directory listing
by aaron_baugher (Curate) on Jul 07, 2012 at 22:31 UTC
    I would think there has to be some way to get a list of files in a remote directory without getting too carried away.

    Not really. Not many servers are setup to allow directory indexing anymore. If going to the directory's URL in your browser shows you a list of files, then you can get the same list with a program of your own. If it doesn't, you can't. The other issue is that the list is normally returned as an HTML page, and different servers may format it in different ways, so it would be hard to create a consistent way to parse out the filenames. Since you're dealing with a single, known system, though, it might not be too hard in your case.

    If the server will allow you to get the index of a directory, then there are various ways you could do it. You could fetch the URL with one of many HTTP aware modules like LWP::Simple or WWW::Mechanize. You could talk to the HTTP port directly through something like Net::Telnet. You'll still get an HTML page to deal with, though.

    Another option would be to shell out to lynx, and use its -dump option to lose the HTML. My version of lynx returns the list of files like this:

    * [1]Parent Directory * [2]1 * [3]2 * [4]3 * [5]abcd

    So parsing that would be easy:

    open my $in, "lynx -dump $myurl |" or die $!; while(<$in>){ chomp; if( /\[(\d+)\](.+)/ ){ next if $1 == 1; print "$2\n"; } }

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Remote Directory listing
by ambrus (Abbot) on Jul 08, 2012 at 16:47 UTC

    I have a script that mirrors some files recursively from a directory. This gets the list of filenames from a webserver-generated directory listing. I show the complete listing here, but the relevant part of the code that parses the listing is outside the readmore tags.

    This script does not try to parse the last modified times from the directory listing. Instead I send a request to download every file, and arrange with HTTP magic that files that were not modified since the last complete download are not downloaded again.

    use XML::Twig (); use Time::HiRes ();
    wrlog "getting directory " . $edir; my $resp = $LWP->get($baseurl . $edir); if ($resp->is_success) { my $twig = XML::Twig->new; if (!$twig->safe_parse($resp->content)) { wrlog "error xml parsing directory listing of " . $edir . +" as xml: " . $@; return; } my($etitle) = $twig->findnodes("//title"); if (!$etitle || $etitle->text !~ /\A\s*Index\b/i) { wrlog "direcotry listing has wrong title " . $edir; return; } for my $ea ($twig->findnodes("//a")) { my $href = $ea->att("href"); my $n = unescape($href); if (defined($n) && $n !~ m"\A[\?\/]") { #wrlog "found link from directory " . $edir . " : " . +escape($n); my $isdir = $n =~ s"\/+\z""; my $abs = $dir . $n; my $eabs = escape($abs);
Re: Remote Directory listing
by johnfl68 (Scribe) on Jul 08, 2012 at 05:50 UTC

    Thank you for your input.

    As I said in the first post, I can view the directory as a standard apache index page. It will always be the same server so in theory it should always be that way.

    So as I understand it, I can read that directory page with LWP::Simple get($url) and then parse it for a RegEx to get a list of the file names?

    If the page contents is read in as a string using get, do I need to do something to separate it into individual lines before looking for the RegEx match?

    I kind of understand the pattern matching, and have tested the pattern match with a regex calculator with the source of the directory page, and get the correct results (except there are 2 of every one) as expected.

    I think I am much closer here, just not sure how to use regex to parse the string from the get($url).

    Again thank you for all your help!

    John

Re: Remote Directory listing
by johnfl68 (Scribe) on Jul 08, 2012 at 10:23 UTC

    OK - I have gotten along much farther now with your help.

    I am able to get the list of files from the remote server, and create a local text file with the 10 most recent files.

    I just need a little more help. I need to apply a regex (I think) as part of the process, as I need the date and time section each of the 10 file for the next step.

    I have this as my last section, giving me the last 10 files in the list:

    open(FILE, "<filelist.txt"); @file = <FILE>; chomp @file; close FILE; open (MYFILE, '>links.txt'); print MYFILE "$file[-1]\n"; print MYFILE "$file[-2]\n"; print MYFILE "$file[-3]\n"; print MYFILE "$file[-4]\n"; print MYFILE "$file[-5]\n"; print MYFILE "$file[-6]\n"; print MYFILE "$file[-7]\n"; print MYFILE "$file[-8]\n"; print MYFILE "$file[-9]\n"; print MYFILE "$file[-10]\n"; close (MYFILE);

    Can I add something to that, that will use this regex "\d\d\d\d\d\d\d\d\_\d\d\d\d" and print just the portion of the filename that matches for each of the 10 files.

    I hope that made sense.

    Thanks again everyone for your help!

    John

      Yes, it's possible to match a regex against these strings. Your code can be optimized like this:
      open(FILE, "<filelist.txt"); @file = <FILE>; chomp @file; close FILE; open (MYFILE, '>links.txt'); for my $i (1..10) { print MYFILE "$file[-$i]"; print "$1\n" if $file[-$i] =~ m/(\d{8}_\d{4})/; } close (MYFILE);
      Sorry if my advice was wrong.
Re: Remote Directory listing
by tobyink (Canon) on Jul 07, 2012 at 21:38 UTC

    Yes, even you can use CPAN.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Which module would you suggest for reliably getting the contents of a remote directory via HTTP?

      AFAIK, it is entirely up to the server and the applications running on it to decide how and if a directory is rendered at any given URL. Just because I can get a document at http://someurl.com/documents/mydoc.txt doesn't mean I can get a directory at http://someurl.com/documents/. I might. But I might just get a 403 - Forbidden, 404 - Not Found, or whatever resource the site's author intends to be served specifically at the URL.

      If the OP had said that he can get a directory to render on his browser by entering http://someurl.com/documents/, then we could point him to LWP::Simple. But I think we're missing some information before we can guide him in that direction with any assurance that the advice is going to work for him.


      Dave

        "AFAIK, it is entirely up to the server and the applications running on it to decide how and if a directory is rendered at any given URL. Just because I can get a document at http://someurl.com/documents/mydoc.txt doesn't mean I can get a directory at http://someurl.com/documents/. I might."

        I inferred from his question (where he said, "http - standard apache index") that this was already not a problem. That he has a particular directory in mind which has a known directory listing format.

        "Which module would you suggest for reliably getting the contents of a remote directory via HTTP?"

        Personally I'd use Web::Magic, but I'm biased.

        use 5.010; use strict; use PerlX::MethodCallWithBlock; use Path::Class qw(file dir); use Web::Magic -sub => 'web'; use XML::LibXML 2.0000; my $listing = URI->new('http://buzzword.org.uk/2012/'); my $destination = dir('/home/tai/tmp/downloaded/'); # Make sure destination directory exists. $destination->mkpath; web($listing) # Die if 404 or some other error -> assert_success # Find all the links on the page -> querySelectorAll('a[href]') # Skip uninteresting links -> grep { not ( /Parent Directory/ or $_->{href} =~ m{\?} # has a query or $_->{href} =~ m{/$} # ends in slash ) } # Expand relative URI references to absolute URIs -> map { URI->new_abs($_->{href}, $listing) } # Save each to the destination directory -> foreach { # Figure out name of file to save as my $filename = $destination->file( [$_->path_segments]->[-1] ) +; # Log a message printf STDERR "Saving <%s> to '%s'\n", $_, $filename; # Save it! web($_)->save_as("$filename"); }
        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Remote Directory listing
by johnfl68 (Scribe) on Jul 14, 2012 at 00:28 UTC

    Thank you everyone for your help and insight with this!

    John

      John - I am looking for something similar and am having zero luck, Can you post your script? Essentially, I am looking for a script to get a list of files from a URL (the site has dir listing enabled) and their filesizes and push that into a hash filename -> filesize. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://980518]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2024-04-23 13:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found