Remote Directory listing

johnfl68 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remote Directory listing by aaron_baugher (Curate) on Jul 07, 2012 at 22:31 UTC
I would think there has to be some way to get a list of files in a remote directory without getting too carried away. Not really. Not many servers are setup to allow directory indexing anymore. If going to the directory's URL in your browser shows you a list of files, then you can get the same list with a program of your own. If it doesn't, you can't. The other issue is that the list is normally returned as an HTML page, and different servers may format it in different ways, so it would be hard to create a consistent way to parse out the filenames. Since you're dealing with a single, known system, though, it might not be too hard in your case. If the server will allow you to get the index of a directory, then there are various ways you could do it. You could fetch the URL with one of many HTTP aware modules like LWP::Simple or WWW::Mechanize. You could talk to the HTTP port directly through something like Net::Telnet. You'll still get an HTML page to deal with, though. Another option would be to shell out to `lynx`, and use its `-dump` option to lose the HTML. My version of lynx returns the list of files like this: `* [1]Parent Directory * [2]1 * [3]2 * [4]3 * [5]abcd` [download] So parsing that would be easy: `open my $in, "lynx -dump $myurl \|" or die $!; while(<$in>){ chomp; if( /\[(\d+)\](.+)/ ){ next if $1 == 1; print "$2\n"; } }` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re: Remote Directory listing by ambrus (Abbot) on Jul 08, 2012 at 16:47 UTC
I have a script that mirrors some files recursively from a directory. This gets the list of filenames from a webserver-generated directory listing. I show the complete listing here, but the relevant part of the code that parses the listing is outside the readmore tags. This script does not try to parse the last modified times from the directory listing. Instead I send a request to download every file, and arrange with HTTP magic that files that were not modified since the last complete download are not downloaded again. Read more... (1028 Bytes) `use XML::Twig (); use Time::HiRes ();` [download] Read more... (3 kB) wrlog "getting directory " . $edir; my $resp = $LWP->get($baseurl . $edir); if ($resp->is_success) { my $twig = XML::Twig->new; if (!$twig->safe_parse($resp->content)) { wrlog "error xml parsing directory listing of " . $edir . +" as xml: " . $@; return; } my($etitle) = $twig->findnodes("//title"); if (!$etitle \|\| $etitle->text !~ /\A\s*Index\b/i) { wrlog "direcotry listing has wrong title " . $edir; return; } for my $ea ($twig->findnodes("//a")) { my $href = $ea->att("href"); my $n = unescape($href); if (defined($n) && $n !~ m"\A[\?\/]") { #wrlog "found link from directory " . $edir . " : " . +escape($n); my $isdir = $n =~ s"\/+\z""; my $abs = $dir . $n; my $eabs = escape($abs); [download] Read more... (10 kB)	[reply] [d/l] [select]
Re: Remote Directory listing by johnfl68 (Scribe) on Jul 08, 2012 at 05:50 UTC
Thank you for your input. As I said in the first post, I can view the directory as a standard apache index page. It will always be the same server so in theory it should always be that way. So as I understand it, I can read that directory page with LWP::Simple get($url) and then parse it for a RegEx to get a list of the file names? If the page contents is read in as a string using get, do I need to do something to separate it into individual lines before looking for the RegEx match? I kind of understand the pattern matching, and have tested the pattern match with a regex calculator with the source of the directory page, and get the correct results (except there are 2 of every one) as expected. I think I am much closer here, just not sure how to use regex to parse the string from the get($url). Again thank you for all your help! John	[reply]
Re: Remote Directory listing by johnfl68 (Scribe) on Jul 08, 2012 at 10:23 UTC
OK - I have gotten along much farther now with your help. I am able to get the list of files from the remote server, and create a local text file with the 10 most recent files. I just need a little more help. I need to apply a regex (I think) as part of the process, as I need the date and time section each of the 10 file for the next step. I have this as my last section, giving me the last 10 files in the list: `open(FILE, "<filelist.txt"); @file = <FILE>; chomp @file; close FILE; open (MYFILE, '>links.txt'); print MYFILE "$file[-1]\n"; print MYFILE "$file[-2]\n"; print MYFILE "$file[-3]\n"; print MYFILE "$file[-4]\n"; print MYFILE "$file[-5]\n"; print MYFILE "$file[-6]\n"; print MYFILE "$file[-7]\n"; print MYFILE "$file[-8]\n"; print MYFILE "$file[-9]\n"; print MYFILE "$file[-10]\n"; close (MYFILE);` [download] Can I add something to that, that will use this regex "\d\d\d\d\d\d\d\d\_\d\d\d\d" and print just the portion of the filename that matches for each of the 10 files. I hope that made sense. Thanks again everyone for your help! John	[reply] [d/l]
Re^2: Remote Directory listing by aitap (Curate) on Jul 08, 2012 at 10:38 UTC
Yes, it's possible to match a regex against these strings. Your code can be optimized like this: `open(FILE, "<filelist.txt"); @file = <FILE>; chomp @file; close FILE; open (MYFILE, '>links.txt'); for my $i (1..10) { print MYFILE "$file[-$i]"; print "$1\n" if $file[-$i] =~ m/(\d{8}_\d{4})/; } close (MYFILE);` [download] Sorry if my advice was wrong.	[reply] [d/l]
Re: Remote Directory listing by tobyink (Canon) on Jul 07, 2012 at 21:38 UTC
Yes, even you can use CPAN. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply]
Re^2: Remote Directory listing by davido (Cardinal) on Jul 07, 2012 at 21:56 UTC
Which module would you suggest for reliably getting the contents of a remote directory via HTTP? AFAIK, it is entirely up to the server and the applications running on it to decide how and if a directory is rendered at any given URL. Just because I can get a document at `http://someurl.com/documents/mydoc.txt` doesn't mean I can get a directory at `http://someurl.com/documents/`. I might. But I might just get a `403 - Forbidden`, `404 - Not Found`, or whatever resource the site's author intends to be served specifically at the URL. If the OP had said that he can get a directory to render on his browser by entering `http://someurl.com/documents/`, then we could point him to LWP::Simple. But I think we're missing some information before we can guide him in that direction with any assurance that the advice is going to work for him. Dave	[reply] [d/l] [select]
Re^3: Remote Directory listing by tobyink (Canon) on Jul 08, 2012 at 07:53 UTC
"AFAIK, it is entirely up to the server and the applications running on it to decide how and if a directory is rendered at any given URL. Just because I can get a document at http://someurl.com/documents/mydoc.txt doesn't mean I can get a directory at http://someurl.com/documents/. I might." I inferred from his question (where he said, "http - standard apache index") that this was already not a problem. That he has a particular directory in mind which has a known directory listing format. "Which module would you suggest for reliably getting the contents of a remote directory via HTTP?" Personally I'd use Web::Magic, but I'm biased. use 5.010; use strict; use PerlX::MethodCallWithBlock; use Path::Class qw(file dir); use Web::Magic -sub => 'web'; use XML::LibXML 2.0000; my $listing = URI->new('http://buzzword.org.uk/2012/'); my $destination = dir('/home/tai/tmp/downloaded/'); # Make sure destination directory exists. $destination->mkpath; web($listing) # Die if 404 or some other error -> assert_success # Find all the links on the page -> querySelectorAll('a[href]') # Skip uninteresting links -> grep { not ( /Parent Directory/ or $_->{href} =~ m{\?} # has a query or $_->{href} =~ m{/$} # ends in slash ) } # Expand relative URI references to absolute URIs -> map { URI->new_abs($_->{href}, $listing) } # Save each to the destination directory -> foreach { # Figure out name of file to save as my $filename = $destination->file( [$_->path_segments]->[-1] ) +; # Log a message printf STDERR "Saving <%s> to '%s'\n", $_, $filename; # Save it! web($_)->save_as("$filename"); } [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re: Remote Directory listing by johnfl68 (Scribe) on Jul 14, 2012 at 00:28 UTC
Thank you everyone for your help and insight with this! John	[reply]
Re^2: Remote Directory listing by StarkRavingCalm (Sexton) on Jun 08, 2015 at 20:34 UTC
John - I am looking for something similar and am having zero luck, Can you post your script? Essentially, I am looking for a script to get a list of files from a URL (the site has dir listing enabled) and their filesizes and push that into a hash filename -> filesize. Thanks!	[reply]


No such thing as a small change
	PerlMonks