http://qs321.pair.com?node_id=311585

inman has asked for the wisdom of the Perl Monks concerning the following question:

I am working with a directory that contains both data files and marker files. The marker files have the same base name as the data files but a different extension. They are used as markers that identify the data files that are no longer used. As a result, the files exist as pairs unless they are current in which case only the data file is present. An example directory listing would contain:

00000001.did 00000001.mrg 00000002.did 00000002.mrg 00000003.did

I want to work with all of the current data files so I am using the code below to find the list of data files that does not have a corresponding marker file.

sub unmergedFiles { my $dir = shift; my %merged; my @files; # create the hash containing the merged files foreach (glob ($dir."/*.mrg")) { if ( /(\S+)\.mrg/i ) { $merged{lc ($1 . ".did")} = "1"; } } foreach (glob ($dir."/*.did")) { unless (exists $merged{lc($_)}) { push @files, $_; } } # return style as per node 311537! return wantarray ? @files : \@files; }

This code works just fine but I would like the opinion of more learned monks as to how this code can be improved. A couple of things spring to mind:

All comments gratefully received!

inman

Replies are listed 'Best First'.
Re: Finding un-paired files in a directory
by Corion (Patriarch) on Dec 02, 2003 at 11:42 UTC

    Personally, I would use a different approach by restating the problem: You are interested in all *.mrg files that have no corresponding *.did file :

    opendir DIR, $dir or die "Couldn't open directory '$dir' : $!"; my @files = grep { /(.*)\.mrg$/ and not -f "$dir/$1.did" } readdir DIR +; closedir DIR;

    My method might be a bit slower, as for each .mrg file, an additional call to stat will be made, which can be very slow on full directories, but I think that the shorter code makes up for the slower code. If speed should really become an issue, I'd readdir the directories contents into a hash and then check for existence in the hash much like your example:

    opendir DIR, $dir or die "Couldn't read '$dir' : $!"; my @all_files = map { lc $_ } readdir DIR; closedir DIR; my %did = map { /(.*)\.did$/ and ($1 => 1) } grep { /\.did$/ } @all_ +files; my @files = grep { /(.*)\.mrg$/ and not $did{$1} } @all_files;

    Also, I don't think that production code should contain references to Perlmonks node IDs, but rather an explanation of what happens :

    # return a list or a reference to an array, depending # on what the caller wants: return wantarray ? @files : \@files;

    Update: Added "faster" alternative

    Update 2: Fixed code in response to merlyns bugfinding

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      opendir DIR, $dir or die "Couldn't open directory '$dir' : $!"; my @files = grep { /(.*)\.mrg$/ and not -f "$1.did" } readdir DIR; closedir DIR
      No, that's testing "-f" on a file in the current directory for a name that should be checked in a different directory. A traditional readdir mistake. {grin}

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        Which is why globing can be so lovely, as it includes the file path for you :) I'm not sure if this is any faster than Abigail-II's solution. At least this does only check for the existence of each file once rather than two checks.

        my $dir = '/path/to/dir'; print join ", ", grep { /(.*)\.mrg\z/ and not -f "$1.did" } <$dir/*\.mrg>'

Re: Finding un-paired files in a directory
by Roger (Parson) on Dec 02, 2003 at 12:10 UTC
    Here's my implementation -
    use strict; use Data::Dumper; chomp(my @files = <DATA>); # find unique basenames my @basenames = keys %{{ map { (split/\./)[0] => 1 } @files }}; # find unpaired files foreach my $basename (@basenames) { my @matches = grep (/$basename/, @files); print "$matches[0]\n" if $#matches == 0; } __DATA__ 00000001.did 00000001.mrg 00000002.did 00000002.mrg 00000003.did 00000004.mrg
    And the output -
    00000004.mrg 00000003.did
      use strict; use Data::Dumper; chomp(my @files = <DATA>); # find unique basenames my @basenames = keys %{{ map { (split/\./)[0] => 1 } @files }}; # find unpaired files foreach my $basename (@basenames) { my @matches = grep (/$basename/, @files); print "$matches[0]\n" if $#matches == 0; }
      No, this solution breaks if any filenames have regex-significant metachars, or are partially within a larger filename.

      In my opinion, the original solution is actually the most direct, as this is a classic set difference problem.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

Re: Finding un-paired files in a directory
by Abigail-II (Bishop) on Dec 02, 2003 at 13:13 UTC
    I'd use something like (untested):
    opendir my $dh => $dir or die; my %files; @files {map {lc} readdir} = (); closedir $dir; my @missing = grep {/^(.*)\.did$/s && !exists $files {"$1.mrg"} keys % +files;

    Abigail

Re: Finding un-paired files in a directory
by Anonymous Monk on Dec 02, 2003 at 19:01 UTC
    You want the "*.did" files that have no corresponding ".mrg" file, right?

    Hmmm... let me type something in here...
    use strict; use File::Find; my %temp; find( sub { s/mrg$/did/; /did$/ and $temp{$_}++; }, '.' ); my @files = grep { $temp{$_} == 1 } keys %temp; print "@files";
    This code takes advantage of the fact that you will have only one or two files with the same basename (and that there are only two extensions of interest). So, just count them up.

    It doesn't handle the case of a lone ".mrg" file. To fix that...
    use strict; use File::Find; my %temp; find( sub { m/^(.+)\.(mrg|did)/ and push(@{$temp{"$1.did"}}, $2) }, '.' ); my @files = grep { @{ $temp{$_} } == 1 and $temp{$_}->[0] eq 'did' } sort keys %temp; print "@files";
    The beauty of these code snippets (actually, they are working scripts) is that they don't do any filetests. File::Find (or readdir if you prefer) have already established the existence of the files -- checking for that again just slows down your code.

    :)

    -Dave
Re: Finding un-paired files in a directory
by Jenda (Abbot) on Dec 02, 2003 at 15:21 UTC
    opendir my $DIR, '.' or die "Can't opendir . : $!\n"; my %single; my %other = (did => 'mrg', mrg => 'did'); while (my $file = readdir $DIR) { #print "$file\n"; my ($name,$ext) = ($file =~ /^(.*)\.(did|mrg)$/) or next; if (exists $single{"$name.$other{$ext}"}) { delete $single{"$name.$other{$ext}"}; } else { $single{$file} = undef; } } print "Unpaired files:\n"; foreach my $file (keys %single) { print "\t$file\n"; }

    A bit longer than most but if the readdir gives you the filenames in order (my does), there is a lot of such files and only a minimum unpaired I believe it would be the quickest and least memory hungry.

    Update: Of course if your filesystem is case insensitive you should lowercase the filename to make sure you look for the pairs case insensitively. You may want store the original case in the %single hash in such case.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Re: Finding un-paired files in a directory
by Necos (Friar) on Dec 03, 2003 at 09:21 UTC
    Well, I'll give this problem a go even though it has been beaten to death already. It's a bit long, but it works all the same.

    use strict; my %files; my @files; @files =<DATA>; @files = map { s/\n//; $_ } @files; close(DATA); foreach (@files) { $files{substr($_,0,(length()-4))}++; } foreach (keys %files) { print "$_ is not matched and has a count of $files{$_}\n" if $file +s{$_} == 1; } __DATA__ 001.did 001.mrg 002.did 002.mrg 003.did 004.did 005.did 005.mrg
    The output of this is:

    C:\configs>perl test.pl
    003 is not matched and has a count of 1
    004 is not matched and has a count of 1

    Of course, this code is nothing more than a modified "seen" loop. Not too fancy. One thing I should point out is that if you use  foreach (<*.mrg>), it'll do the same thing as your glob (because it is the same thing as far as I know). It's just shorter. Hope that sparks a few ideas.

    Theodore Charles III
    Network Administrator
    Los Angeles Senior High
    email->secon_kun@hotmail.com
    perl -e "map{print++$_}split//,Mdbnr;"
Re: Finding un-paired files in a directory
by mce (Curate) on Dec 03, 2003 at 12:49 UTC
    Hi,

    If this is the only thing you want to do, I would use bash

    for i in *did; do [ ! -f ${i%\.did}.mrg ] && echo $i; done
    (Don't take this post to serious, I would -of course- go for perl if I need to do more than just display the files)
    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    BMC, Belgium