Finding un-paired files in a directory

inman has asked for the wisdom of the Perl Monks concerning the following question:

I am working with a directory that contains both data files and marker files. The marker files have the same base name as the data files but a different extension. They are used as markers that identify the data files that are no longer used. As a result, the files exist as pairs unless they are current in which case only the data file is present. An example directory listing would contain:

00000001.did
00000001.mrg
00000002.did
00000002.mrg
00000003.did
[download]

I want to work with all of the current data files so I am using the code below to find the list of data files that does not have a corresponding marker file.

sub unmergedFiles
{
    my $dir = shift;
    my %merged;
    my @files;
    
    # create the hash containing the merged files
    foreach (glob ($dir."/*.mrg"))
    {
        if ( /(\S+)\.mrg/i )
        {
            $merged{lc ($1 . ".did")} = "1";
        }
    }
    
    foreach (glob ($dir."/*.did"))
    {
        unless (exists $merged{lc($_)})
        {
            push @files, $_;
        }
    }
    # return style as per node 311537!
    return wantarray ? @files : \@files;
}
[download]

This code works just fine but I would like the opinion of more learned monks as to how this code can be improved. A couple of things spring to mind:

There is probably a CPAN module to do this already but I haven't looked as writing the code was educational.
I am really just comparing the contents of two arrays to find the difference. Have I missed out on a more concise method?
I am creating a hash by iterating over an array and assigning new hash elements. Is there a better way of doing this?

All comments gratefully received!

inman

Comment on Finding un-paired files in a directory Select or Download Code

Replies are listed 'Best First'.
Re: Finding un-paired files in a directory by Corion (Patriarch) on Dec 02, 2003 at 11:42 UTC
Personally, I would use a different approach by restating the problem: You are interested in all `.mrg` files that have no corresponding `.did` file : `opendir DIR, $dir or die "Couldn't open directory '$dir' : $!"; my @files = grep { /(.)\.mrg$/ and not -f "$dir/$1.did" } readdir DIR +; closedir DIR;` [download] My method might be a bit slower, as for each `.mrg` file, an additional call to `stat` will be made, which can be very slow on full directories, but I think that the shorter code makes up for the slower code. If speed should really become an issue, I'd `readdir` the directories contents into a hash and then check for existence in the hash much like your example: `opendir DIR, $dir or die "Couldn't read '$dir' : $!"; my @all_files = map { lc $_ } readdir DIR; closedir DIR; my %did = map { /(.)\.did$/ and ($1 => 1) } grep { /\.did$/ } @all_ +files; my @files = grep { /(.)\.mrg$/ and not $did{$1} } @all_files;` [download] Also, I don't think that production code should contain references to Perlmonks node IDs, but rather an explanation of what happens : `# return a list or a reference to an array, depending # on what the caller wants: return wantarray ? @files : \@files;` [download] Update:* Added "faster" alternative Update 2: Fixed code in response to merlyns bugfinding `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
•Re: Re: Finding un-paired files in a directory by merlyn (Sage) on Dec 02, 2003 at 12:13 UTC
`opendir DIR, $dir or die "Couldn't open directory '$dir' : $!"; my @files = grep { /(.*)\.mrg$/ and not -f "$1.did" } readdir DIR; closedir DIR` [download] No, that's testing "-f" on a file in the current directory for a name that should be checked in a different directory. A traditional `readdir` mistake. {grin} -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re^3: Finding un-paired files in a directory by Coruscate (Sexton) on Dec 02, 2003 at 21:26 UTC
Which is why globing can be so lovely, as it includes the file path for you :) I'm not sure if this is any faster than Abigail-II's solution. At least this does only check for the existence of each file once rather than two checks. `my $dir = '/path/to/dir'; print join ", ", grep { /(.)\.mrg\z/ and not -f "$1.did" } <$dir/\.mrg>'` [download]	[reply] [d/l]
Re: Finding un-paired files in a directory by Roger (Parson) on Dec 02, 2003 at 12:10 UTC
Here's my implementation - `use strict; use Data::Dumper; chomp(my @files = <DATA>); # find unique basenames my @basenames = keys %{{ map { (split/\./)[0] => 1 } @files }}; # find unpaired files foreach my $basename (@basenames) { my @matches = grep (/$basename/, @files); print "$matches[0]\n" if $#matches == 0; } __DATA__ 00000001.did 00000001.mrg 00000002.did 00000002.mrg 00000003.did 00000004.mrg` [download] And the output - `00000004.mrg 00000003.did` [download]	[reply] [d/l] [select]
•Re: Re: Finding un-paired files in a directory by merlyn (Sage) on Dec 02, 2003 at 12:16 UTC
`use strict; use Data::Dumper; chomp(my @files = <DATA>); # find unique basenames my @basenames = keys %{{ map { (split/\./)[0] => 1 } @files }}; # find unpaired files foreach my $basename (@basenames) { my @matches = grep (/$basename/, @files); print "$matches[0]\n" if $#matches == 0; }` [download] No, this solution breaks if any filenames have regex-significant metachars, or are partially within a larger filename. In my opinion, the original solution is actually the most direct, as this is a classic set difference problem. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: Finding un-paired files in a directory by Abigail-II (Bishop) on Dec 02, 2003 at 13:13 UTC
I'd use something like (untested): `opendir my $dh => $dir or die; my %files; @files {map {lc} readdir} = (); closedir $dir; my @missing = grep {/^(.*)\.did$/s && !exists $files {"$1.mrg"} keys % +files;` [download] Abigail	[reply] [d/l]
Re: Finding un-paired files in a directory by Anonymous Monk on Dec 02, 2003 at 19:01 UTC
You want the "*.did" files that have no corresponding ".mrg" file, right? Hmmm... let me type something in here... `use strict; use File::Find; my %temp; find( sub { s/mrg$/did/; /did$/ and $temp{$_}++; }, '.' ); my @files = grep { $temp{$_} == 1 } keys %temp; print "@files";` [download] This code takes advantage of the fact that you will have only one or two files with the same basename (and that there are only two extensions of interest). So, just count them up. It doesn't handle the case of a lone ".mrg" file. To fix that... `use strict; use File::Find; my %temp; find( sub { m/^(.+)\.(mrg\|did)/ and push(@{$temp{"$1.did"}}, $2) }, '.' ); my @files = grep { @{ $temp{$_} } == 1 and $temp{$_}->[0] eq 'did' } sort keys %temp; print "@files";` [download] The beauty of these code snippets (actually, they are working scripts) is that they don't do any filetests. File::Find (or readdir if you prefer) have already established the existence of the files -- checking for that again just slows down your code. :) -Dave	[reply] [d/l] [select]
Re: Finding un-paired files in a directory by Jenda (Abbot) on Dec 02, 2003 at 15:21 UTC
`opendir my $DIR, '.' or die "Can't opendir . : $!\n"; my %single; my %other = (did => 'mrg', mrg => 'did'); while (my $file = readdir $DIR) { #print "$file\n"; my ($name,$ext) = ($file =~ /^(.)\.(did\|mrg)$/) or next; if (exists $single{"$name.$other{$ext}"}) { delete $single{"$name.$other{$ext}"}; } else { $single{$file} = undef; } } print "Unpaired files:\n"; foreach my $file (keys %single) { print "\t$file\n"; }` [download] A bit longer than most but if the readdir gives you the filenames in order (my does), there is a lot of such files and only a minimum unpaired I believe it would be the quickest and least memory hungry. Update:* Of course if your filesystem is case insensitive you should lowercase the filename to make sure you look for the pairs case insensitively. You may want store the original case in the %single hash in such case. Jenda Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. -- Rick Osborne Edit by castaway: Closed small tag in signature	[reply] [d/l]
Re: Finding un-paired files in a directory by Necos (Friar) on Dec 03, 2003 at 09:21 UTC
Well, I'll give this problem a go even though it has been beaten to death already. It's a bit long, but it works all the same. `use strict; my %files; my @files; @files =<DATA>; @files = map { s/\n//; $_ } @files; close(DATA); foreach (@files) { $files{substr($_,0,(length()-4))}++; } foreach (keys %files) { print "$_ is not matched and has a count of $files{$_}\n" if $file +s{$_} == 1; } __DATA__ 001.did 001.mrg 002.did 002.mrg 003.did 004.did 005.did 005.mrg` [download] The output of this is: C:\configs>perl test.pl 003 is not matched and has a count of 1 004 is not matched and has a count of 1 Of course, this code is nothing more than a modified "seen" loop. Not too fancy. One thing I should point out is that if you use `foreach (<*.mrg>)`, it'll do the same thing as your glob (because it is the same thing as far as I know). It's just shorter. Hope that sparks a few ideas. Theodore Charles III Network Administrator Los Angeles Senior High email->secon_kun@hotmail.com `perl -e "map{print++$_}split//,Mdbnr;"`	[reply] [d/l] [select]
Re: Finding un-paired files in a directory by mce (Curate) on Dec 03, 2003 at 12:49 UTC
Hi, If this is the only thing you want to do, I would use bash `for i in *did; do [ ! -f ${i%\.did}.mrg ] && echo $i; done` [download] (Don't take this post to serious, I would -of course- go for perl if I need to do more than just display the files) --------------------------- Dr. Mark Ceulemans Senior Consultant BMC, Belgium	[reply] [d/l]

Back to Seekers of Perl Wisdom