Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello dear Monks,
I need to rename many files (~5,000) in a directory. They have the following structure:
8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
and I need to strip off the first 7 numbers (userID and then year-month-day-hour-minute-second).
The challenge here is that some of them might actually have the same filename remaining after this operation, like the following example:
8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
So, in this case, I would like to keep only the most recent one. To do this, I can use the following simple code I wrote:
use Date::Parse;
$file1 = '2019-3-25T14:40:22';
$file2 = '2019-6-12T22:34:8';
if (str2time( $file1 ) - str2time( $file2 ))
{
print "$file1 is more recent than $file2\n";
}
after transforming the dates into the above-mentioned command.
My problem is, how to go about finding and comparing the ones that actually have the same filename, after stripping out the unnecessary information from the beginning - so that I know which ones I need to compare.
any help will be greatly appreciated.
Thanks!
Re: Batch file renaming - on identical name, keep only most recent file, based on dates (updated)
by haukex (Archbishop) on Nov 05, 2019 at 14:39 UTC
|
Here's how I might have coded this. Disclaimer: I've only done a small amount of testing on this, use at your own risk!
#!/usr/bin/env perl
use warnings;
use strict;
use Time::Piece;
use File::Spec::Functions qw/ no_upwards catfile /;
use Getopt::Long qw/ HelpMessage :config posix_default gnu_compat
bundling auto_help /;
use Data::Dumper;
$Data::Dumper::Quotekeys=0;
$Data::Dumper::Useqq=$Data::Dumper::Sortkeys=1;
=head1 SYNOPSIS
myrename.pl [OPTIONS] PATH
OPTIONS:
-r | --run - Actually perform actions
-v | --verbose - With --run, report actions
-q | --quiet - Suppress warning messages
-d | --debug - Enable debugging (overrides -v and -q)
=cut
GetOptions(
'r|run' => \( my $RUN ),
'v|verbose' => \( my $VERBOSE ),
'q|quiet' => \( my $QUIET ),
'd|debug' => \( my $DEBUG ),
version => sub { print q$myrename.pl v0.01$,"\n"; exit },
) or HelpMessage(-exitval=>255);
HelpMessage(-exitval=>255) unless @ARGV==1;
if ( $DEBUG ) { $VERBOSE=1; $QUIET=0; }
my $PATH = $ARGV[0];
print STDERR Data::Dumper->Dump([$PATH],['PATH']) if $DEBUG;
opendir my $dh, $PATH or die "$PATH: $!";
my @FILES = sort grep { -f catfile($PATH,$_) } no_upwards readdir $dh;
closedir $dh;
print STDERR Data::Dumper->Dump([\@FILES],['*FILES']) if $DEBUG;
my %files;
FILE: for my $origfile (@FILES) {
my ($uid,$time,$file) = $origfile =~ /\A(\d+_)((?:\d+_){6})(.+)\z/
or do { warn "No match, skipping $origfile\n"
unless $QUIET; next FILE };
print STDERR Data::Dumper->Dump([$uid,$time,$file],
[qw/uid time file/]) if $DEBUG;
$time = Time::Piece->strptime($time, '%Y_%m_%d_%H_%M_%S_')->epoch;
push @{ $files{$file} }, { origfile => $origfile, time=>$time };
}
@$_ = sort { $b->{time} <=> $a->{time} } @$_ for values %files;
print STDERR Data::Dumper->Dump([\%files],['*files']) if $DEBUG;
for my $file (sort keys %files) {
my $keep = shift @{ $files{$file} };
my $srcfile = catfile($PATH,$keep->{origfile});
my $dstfile = catfile($PATH,$file);
print "Rename $srcfile to $dstfile\n" if !$RUN || $VERBOSE;
die "Destination file exists: $dstfile\n" if -e $dstfile;
# NOTE: There is a possible race condition between -e and rename
if ($RUN) {
rename($srcfile, $dstfile)
or die "rename($srcfile, $dstfile): $!";
}
for my $drop ( @{ $files{$file} } ) {
my $dropfile = catfile($PATH,$drop->{origfile});
print "Drop $dropfile\n" if !$RUN || $VERBOSE;
if ($RUN) {
unlink($dropfile) or die "unlink($dropfile): $!";
}
}
}
warn "This was a dry-run, no actions performed\n" unless $RUN;
For a set of files ( "2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf", "8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf", "8_2007_5_22_15_34_23_Table_-_2008522_XYZ_W3.pdf", "8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf" ), the output is:
No match, skipping 2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
Rename x/8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf to x/Table_-_
+2007522_XYZ_W3.pdf
Drop x/8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
Rename x/8_2007_5_22_15_34_23_Table_-_2008522_XYZ_W3.pdf to x/Table_-_
+2008522_XYZ_W3.pdf
Update: I guess a few words of explanation would be helpful. First off, note that this loads the entire list of files into memory, but with 5k files, I think that should be fine. Most of the first half of the script is just setting up and reading the list of files from the directory. The interesting stuff happens in the %files hash: it's a hash where the keys are the target filenames, and each value is an array of the original filenames, along with their datetimes parsed into UNIX timestamps (use the --debug switch to see the data structures). This allows me to simply sort each list of files (the @$_ = sort ... step) so that the first element of each array is the latest filename. Then, I loop over all the files again, taking the first element of each array as the file to keep and rename to the target filename, and I delete all the other files. I hope this makes sense, and feel free to ask if anything is unclear. (Note I used core modules only.) | [reply] [d/l] [select] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by Fletch (Bishop) on Nov 05, 2019 at 14:12 UTC
|
my @candidate_files = obtain_candidates( ); ## File::Find::Rule, opend
+ir/readdir, ...
my %seen_files;
my @to_delete;
for my $file ( @candidate_files ) {
## regex out, use str2date, yadda yadda yadda
my( $real_file, $date ) = parse_filename( $file );
if( not exists $seen_files{ $real_file } ) {
$seen_files{ $real_file } = [ $file, $date ];
} else {
my $prior_date = $seen_files{ $real_file }->[1];
if( $date - $prior_date > 0 ) {
$seen_files{ $real_file } = [ $file, $date ];
} else {
push @to_delete, $file;
}
}
}
process_deletions( @to_delete );
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by siberia-man (Friar) on Nov 05, 2019 at 16:41 UTC
|
| [reply] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by jcb (Parson) on Nov 06, 2019 at 02:13 UTC
|
That you are stripping off a userID field seems suspicious to me. Are you sure that you want to conflate files from different users?
| [reply] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by Lotus1 (Vicar) on Nov 06, 2019 at 04:55 UTC
|
jcb beat me to the point about collisions from different users. (Edit: it's late here and I got things reversed in my head. No need for the reverse for the sort.) Here is a simple solution that works except it will clobber any collisions between users with the user with the lowest highest number. This solution takes a reverse sort of the filenames and then uses File::Copy::move to rename them so that the newest collision file is the last one to be renamed. If you specify what to do by user collisions this could be modified to watch for the user number.
Perl has a built in rename function but the documentation suggests using File::Copy::move since it is more portable across operating systems.
use warnings;
use strict;
use File::Copy;
use File::Glob ':glob';
mkdir 'test' unless -d 'test';
#foreach( reverse sort glob( "*_Table_*.pdf" ) ){
foreach( sort glob( "*_Table_*.pdf" ) ){
print "$_\n";
my $newname = $_;
$newname =~ s/^[0-9_]+//;
print "--$newname\n";
## using copy for testing.
copy $_, "./test/$newname" or print "Error copying <$_> $!\n";
#move $_, "$newname" or print "Error renaming <$_> $!\n";
}
__DATA__
test files:
8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
7_2007_5_22_16_35_23_Table_-_2007522_XYZ_W3.pdf
7_2007_5_22_23_36_12_Table_-_2007522_XYZ_W3.pdf
output file:
Table_-_2007522_XYZ_W3.pdf
Edit: I forgot to add the program output. Also, I had an extra file in my test files.
8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
7_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
7_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
7_2007_12_22_15_34_23_Table_-_20071222_XYZ_W3.pdf
--Table_-_20071222_XYZ_W3.pdf
Edit: updated output without reverse sort.
7_2007_12_22_15_34_23_Table_-_20071222_XYZ_W3.pdf
--Table_-_20071222_XYZ_W3.pdf
7_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
7_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
8_2007_5_22_15_34_23_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
8_2007_5_22_22_34_12_Table_-_2007522_XYZ_W3.pdf
--Table_-_2007522_XYZ_W3.pdf
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
8_2007_10_22_15_34_23_Table_-_20071022_XYZ_W3.pdf
8_2007_9_22_15_34_23_Table_-_2007922_XYZ_W3.pdf
The new filenames for these two will be:
Table_-_20071022_XYZ_W3.pdf
Table_-_2007922_XYZ_W3.pdf
Having the sorting correct only matters among the filenames that start with the same YYYY?MDD dates. The HH_mm_ss format seems to be consistent and therefore sortable with the default sort. I'm assuming that from the limited example data we were given.
| [reply] [d/l] [select] |
|
|
|
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by Anonymous Monk on Nov 05, 2019 at 14:23 UTC
|
Ok, so I brought it here:
I first get all the filenames into one text file and then I run the following, on the file:
use strict;
use warnings;
use Date::Parse;
my @AoA_files=();
open IN, $ARGV[0];
while(<IN>)
{
my $infile=$_;
chomp $infile;
if($infile=~/^(\d+)\_(\d+)_(\d+)_(\d+)_(\d+)_(\d+)_(\d+)_(.*)/)
{
my $uid=$1;
my $year=$2;
my $month=$3;
my $day=$4;
my $hour=$5;
my $minute=$6;
my $second=$7;
my $actual_filename=$8;
my $timestamp = "$year-$month-$day"."T"."$hour:$minute:$second
+";
if ( grep { $_->[1] eq $actual_filename } @AoA_files ) #if fil
+ename exists, keep only the most recent one
{
#compare the timestamps
my $existing_timestamp = $_->[0];
my $difference_in_seconds = str2time($timestamp) - str2tim
+e($existing_timestamp);
if($difference_in_seconds > 0) #current date is newer than
+ the one in the AoA - use this one
{
push @AoA_files, [ $timestamp, $actual_filename];
}
}
else #add it normally to the AoA
{
push @AoA_files, [ $timestamp, $actual_filename];
}
}
}
My idea here is to eventually create a second file with the full filenames that I need to keep.
Does that make sense? | [reply] [d/l] |
|
- ALWAYS check the return value from open and show the error: open( ... ) or die qq{Problem opening '$ARGV[0]': $!\n};
- If you'd use a hash like I suggested rather than an AoA it'd be more efficient; you're doing a linear search over all of your filenames repeatedly whereas a hash can just do exists
Those nitpicks aside, seeing as that's basically my suggestion above . . . looks sane. (Sanity not guaranteed. Contents may have settled during shipping. Prices may be higher in AK and HI.)
Edit: tweaked wording of second item slightly. Me no make sense much early morning do.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by shadowsong (Pilgrim) on Nov 06, 2019 at 23:02 UTC
|
Hi Anon
Here's another approach
#!perl -sl
use strict;
use warnings FATAL => 'all';
my $PATH = "c:/code/test";
my $processed_hash_ref;
chdir $PATH; # filetests using relative path
opendir DH, "." or die "Couldn't open directory $PATH: $!";
while ($_ = readdir (DH)) {
next unless -f $_
&& m/^\d_\d{4}_\d\d?_\d\d?_\d\d?_\d\d?_\d\d?_table/i;
my $fname = join ('_', (split /_/)[ 7..11 ]);
# already seen a more recent file with the name we're
# planning to use? skip current file
next if defined $processed_hash_ref->{$fname} &&
$processed_hash_ref->{$fname}->{AGE} < -M _;
# file safe to be renamed... for now
$processed_hash_ref->{$fname}->{OLDNAME} = $_;
$processed_hash_ref->{$fname}->{AGE} = -M _;
}
closedir DH;
# now proceed to rename items
map {
print "[!] rename $processed_hash_ref->{$_}->{OLDNAME} to $_";
rename $processed_hash_ref->{$_}->{OLDNAME}, $_;
} keys %$processed_hash_ref;
Cheers Shadow | [reply] [d/l] |
|
-M _
I think it's very much worth noting that this relies on the file's modification time being the same as the timestamp in the filename, which is IMHO is pretty risky. For example, if the timestamp in the filename is the time a log file was started, its modification time may very well be much later than that. Or, if the files were copied into this directory from somewhere else, the modification times may be completely unreliable.
| [reply] [d/l] |
|
++, but the timestamp in the original file name doesn't necessarily refer to creation/update of the corresponding file, either.
| [reply] |
Re: Batch file renaming - on identical name, keep only most recent file, based on dates
by shadowsong (Pilgrim) on Dec 19, 2019 at 00:45 UTC
|
For posterity,
Here's another approach that avoids loading the entire list in memory and also neglects the use of -M (it relies specifically on the naming convention of the candidate target files in OP's description).
hauxex points out some caveats using the -M file tests that could render its use unreliable. Gabor Szabo also highlights its weirdness.
Sharing the code below to illustrate the idea behind the computation of the psuedo_mtime field which was used for comparisons.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use Getopt::Long qw ( :config no_ignore_case bundling );
use Pod::Usage qw ( pod2usage );
# parse options
my (%opt,$dir);
GetOptions(\%opt, "help|h", "verbose|v", "test|t", "version", "directo
+ry|d=s" => \$dir) ||
pod2usage(-verbose => 0);
pod2usage(-verbose => 1, -exitval => 1) if $opt{'help'};
pod2usage(-verbose => 99, -exitval => 0, -sections => [qw(VERSION)]) i
+f $opt{'version'};
pod2usage(-verbose => 0, -message => "$0: No directory given\n") if !
+$dir;
my @AGE_RATIONALISER = (31104000,2592000,86400,3600,60,1); # yy,mm,dd,
+hh,mm,ss converted to secs
my $processed_hash_ref;
opendir DH, $dir or die "Couldn't open directory $dir: $!";
chdir $dir; # filetests using relative path
while ($_ = readdir (DH)) {
next unless -f $_
&& m/^\d_\d{4}_\d\d?_\d\d?_\d\d?_\d\d?_\d\d?_table/i;
my @old_fname = split /_/;
my $new_fname = join ('_', @old_fname[ 7..11 ]);
my $psuedo_mtime = 0; # divine a fake but consistent mtime f
+rom the filename
$psuedo_mtime += $old_fname[$_+1] * $AGE_RATIONALISER[$_] for (0..
+$#AGE_RATIONALISER);
# already seen a more recent file with the name we're
# planning to use? skip current file
next if defined $processed_hash_ref->{$new_fname} &&
$processed_hash_ref->{$new_fname}->{AGE} > $psuedo_mtime;
# file safe to be renamed... for now
$processed_hash_ref->{$new_fname}->{OLDNAME} = $_;
$processed_hash_ref->{$new_fname}->{AGE} = $psuedo_mtime;
}
closedir DH;
# now proceed to rename items
map {
print "[!] rename $processed_hash_ref->{$_}->{OLDNAME} to $_\n" if
+ $opt{'verbose'};
rename $processed_hash_ref->{$_}->{OLDNAME}, $_ unless $opt{'test'
+};
} keys %$processed_hash_ref;
__END__
=head1 NAME
=over 4
rename.pl - file renaming utility
=back
=head1 SYNOPSIS
=over 4
rename.pl [-t | --test] {-d log-file-directory}
rename.pl [-v | --verbose] {-d log-file-directory}
rename.pl [-h | --help]
rename.pl [--version]
=back
=head1 OPTIONS
=over 4
=item B<-t, --test>
Test run - no renaming takes place - effectively a simulation
=item B<-v, --verbose>
Verbose - printout of files to be renamed
=item B<-h, --help>
Help - prints this manual
=back
=head1 DESCRIPTION
B<This program> will read the given directory and rename a
contingent of the files within said directory. Files to be
renamed that share the same target name - will have only
the most recent one renamed.
=head1 AUTHORS
Shadowsong
=head1 VERSION
0.0.1-beta
=cut
| [reply] [d/l] [select] |
|
|