Checking for new files

New Novice has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Ones,

I downloaded a (huge) number of webpages from the internet using a perl routine some time ago. Now I want to check if there are any files that I missed or that are new.

It is actually more complicated than this (as I have to retrieve information from a database first, open up webpages and then extract bits and pieces from the resulting pages). What it boils down to, however, is that I have a list of the files on my computer and that I can generate a list of the webpages. I can link these two by a common piece of information (there is an ID-number on the webpages, which I use to construct the file name). Now the question is, how do I create a third list that gives me all the webpages that my new search of the database returned but which I haven't downloaded yet (i.e., are not contained in the list of files). How can I compare the elements in two lists with a view to elements that are contained in one but not the other.

Go with Perl!

Many Thanks in advance!

Comment on Checking for new files

Replies are listed 'Best First'.
Re: Checking for new files by Corion (Patriarch) on Jan 28, 2005 at 09:30 UTC
`perldoc -q difference` How do I compute the difference of two arrays? How do I compute the intersect ion of two arrays? Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array: `@union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element; }` [download] Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation. The example computes the symmetric difference, but most likely you will only be interested in the pages that are new on the web and missing in your local copy, so you will want to modify the check as follows so it only gives you the locally missing items: `use strict; my (@local) = get_local_ids(); my (@remote) = get_remote_ids(); my %have_local = (); foreach $element (@local) { $have_local{$element}++ }; foreach $id (@remote) { next if $have_local{$id}; retrieve($id); $have_local{$id}++; }` [download]	[reply] [d/l] [select]
Re: Checking for new files by ambrus (Abbot) on Jan 28, 2005 at 10:21 UTC
Sort the two lists and compare with comm: `$ cat a one two three four five six seven $ cat b two four five six seven $ sort a > a.sorted $ sort b > b.sorted $ comm -23 a.sorted b.sorted one three $` [download] Update 2009 sep 2. See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.	[reply] [d/l]
Re^2: Checking for new files by kelan (Deacon) on Jan 28, 2005 at 14:56 UTC
Use zsh! :) `$ comm -23 <(sort a) <(sort b)` [download]	[reply] [d/l]
Re^3: Checking for new files by ambrus (Abbot) on Jan 28, 2005 at 15:02 UTC
That works in bash too, and I'd use that too; but it's not as portable so I didn't write it here. Compare Re: comparing arrays and Re: help with lists.	[reply]


go ahead... be a heretic
	PerlMonks