Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Checking for new files

by New Novice (Sexton)
on Jan 28, 2005 at 09:22 UTC ( [id://425883]=perlquestion: print w/replies, xml ) Need Help??

New Novice has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Ones,

I downloaded a (huge) number of webpages from the internet using a perl routine some time ago. Now I want to check if there are any files that I missed or that are new.

It is actually more complicated than this (as I have to retrieve information from a database first, open up webpages and then extract bits and pieces from the resulting pages). What it boils down to, however, is that I have a list of the files on my computer and that I can generate a list of the webpages. I can link these two by a common piece of information (there is an ID-number on the webpages, which I use to construct the file name). Now the question is, how do I create a third list that gives me all the webpages that my new search of the database returned but which I haven't downloaded yet (i.e., are not contained in the list of files). How can I compare the elements in two lists with a view to elements that are contained in one but not the other.

Go with Perl!

Many Thanks in advance!

Replies are listed 'Best First'.
Re: Checking for new files
by Corion (Patriarch) on Jan 28, 2005 at 09:30 UTC
    perldoc -q difference
    How do I compute the difference of two arrays? How do I compute the intersect ion of two arrays?

    Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

    @union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element; }

    Note that this is the *symmetric difference*, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

    The example computes the symmetric difference, but most likely you will only be interested in the pages that are new on the web and missing in your local copy, so you will want to modify the check as follows so it only gives you the locally missing items:

    use strict; my (@local) = get_local_ids(); my (@remote) = get_remote_ids(); my %have_local = (); foreach $element (@local) { $have_local{$element}++ }; foreach $id (@remote) { next if $have_local{$id}; retrieve($id); $have_local{$id}++; }
Re: Checking for new files
by ambrus (Abbot) on Jan 28, 2005 at 10:21 UTC

    Sort the two lists and compare with comm:

    $ cat a one two three four five six seven $ cat b two four five six seven $ sort a > a.sorted $ sort b > b.sorted $ comm -23 a.sorted b.sorted one three $

    Update 2009 sep 2.

    See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.

      Use zsh! :)

      $ comm -23 <(sort a) <(sort b)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://425883]
Approved by jfroebe
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-16 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found