Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: [OT] Tar file with non-identical duplicate files and no paths?

by johngg (Canon)
on Sep 13, 2008 at 10:50 UTC ( [id://711085]=note: print w/replies, xml ) Need Help??


in reply to [OT] Tar file with non-identical duplicate files and no paths?

I don't know if this is any help at all but I had to write a script to restore files from archives supposedy written in tar format on a Pr1me mini. The trouble was, the utility had truncated all of the filenames to some arbitrary length so there were loads of duplicates to resolve.

#!/usr/local/bin/perl -w # # Script requires two arguments, the name of the tar archive and the # name of the directory into which the archive should be restored. # if($#ARGV != 1) {die "Usage: $0 tar-file dest-dir\n"} $tar_file = shift @ARGV; $dest_dir = shift @ARGV; # # Check that we can open the archive and that the directory is valid. +Open # a log file to record events. # use FileHandle; open(TAR, "$tar_file") || die "Can't open tar archive\n"; die "$dest_dir is not a directory\n" unless(-d $dest_dir); open(LOG, ">extract_log") || die "Can't open log file\n"; LOG->autoflush(1); # # Unset flags 'cos we haven't got to end of archive yet and we haven't # found a bad checksum. Set up string values to check that we have # reached the end of data for each file in the archive. # $eot = 0; $bad_checksum = 0; $eodata_str1 = pack "CC", 0, 128; $eodata_str2 = pack "CC", 128, 128; $eodata_str3 = pack "C", 0; $eodata_str4 = pack "C", 128; # # Set flag so we look for a header at beginning of archive, initialise # hash of objects in archive with duplicate names (because the utility # on the Prime mini that produced the archives truncated filenames # after 14 characters). Archive written in blocks of 512 bytes so # read each block into a buffer. Fatal error if unable to read 512 byt +es. # $need_hdr = 1; %dups = (); while(($len = read TAR, $buf, 512) && ! $bad_checksum) { die "Error reading archive\n" if $len != 512; # # Looking for header? # if($need_hdr) { # # Archive terminates with two blocks of Ascii NULLs, wind up # if second consecutive NULL block found when a header block i +s # expected, set flag if first block. Blocks of NULLs in data d +o # not matter as the data is copied out until the byte count in # the header is satisfied. # unless(unpack "C512", $buf) { if($eot) { print LOG "\nEnd of archive reached\n\n"; last; # Finished reading block +s. } else { $eot = 1; next; # Get next block. } } # # Name of object in archive is held as a C-style NULL terminat +ed # string in the first 100 bytes of the header. Strip NULL onwa +rds # and strip any trailing "/" because, unlike UNIX tar files, t +he # Prime version does not seem to have separate entries for # directories. # $object = substr($buf, 0, 100); $object =~ s/\x00.*//; $object =~ s/\/$//; # # Size of file held as octal value. # $size = oct(substr($buf, 124, 11)); # # Header checksum held as octal vaue. Calculated as sum of Asc +ii # values of all bytes in the header assuming that the checksum # field is all blanks (spaces). Save checksum and replace with # spaces. Initialse calculated checksum, split header into # individual characters and sum their ordinal values. # $chksum = substr($buf, 147, 8); substr($buf, 147, 8) = " "; $calc_chksum = 0; @hdr_chars = split('', $buf); foreach $char (@hdr_chars) { $calc_chksum += ord($char); } # # Print name and size of object to log file, print indication # of whether checksum was OK or not. If it was bad we don't # want to carry on because data could also be corrupt. # print LOG "$object <-> $size "; if($calc_chksum == oct($chksum)) { print LOG "CHKOK "; } else { print LOG "CHKNO \n"; $bad_checksum = 1; last; } # # Initialise target directory from destination directory given + as # an argument. Split object into path elements on '/', pop fil +e # name off end as last element. # $target_dir = $dest_dir; @tree = split('/', $object); $file = pop(@tree); # # Check that file name is not "." or "..", apparently legal in # Primos, since we found one. Change to "__dot" or "__dotdot" # which are less significant under Unix. Hopefully, there will # not be any genuine files called either of those. # if($file eq ".") { $file = "__dot"; print LOG "ill-> $file "; } if($file eq "..") { $file = "__dotdot"; print LOG "ill-> $file "; } # # Get directory elements one at a time concatenating them to t +he # target directory, make non-existent directories if necessary +. # while($dir = shift(@tree)) { $target_dir .= "/"; $target_dir .= $dir; mkdir("$target_dir", 0755) unless(-x "$target_dir"); } # # Because of the Prime filename truncation problem, check if t +he # object we are about to restore already exists (the target # directory specified in the arguments should always be empty # as a precaution). If it does exist, increment the value in # the hash indexed by object; use value to differentiate names +. # if(-e "$target_dir/$file") { $dups{"$object"}++; $file .= "__dup"; $file .= $dups{"$object"}; print LOG "dup-> $file "; } # # Open file to be extracted from archive (will create). If the # size field in header indicates that data will follow, unset # the flag because next block should no be a header. If size i +s # zero however, file was empty so print message and close file # again, leaving flag alone. # open(OUT, ">$target_dir/$file") || die "Couldn't open output\n +"; if($size) { $need_hdr = 0; } else { print LOG "EMPTY\n"; close(OUT) || die "Couldn't close output\n"; } } else { # # This is a data block. # if($size < 512) { # # Are there less than 512 bytes of data remaining to be # extracted from archive? Print only required bytes to # file using "substr". Check that next pair of bytes (or # single byte if 511 bytes were written) match possible # strings defined above. Message either way, close file. # Set header needed flag. # print OUT substr($buf, 0, $size); if((index $buf, $eodata_str1, $size) == $size || (index $buf, $eodata_str2, $size) == $size || ($size == 511 && ((pack "C", ord(substr($buf, 511, 1))) eq $eodata_str3 +|| (pack "C", ord(substr($buf, 511, 1))) eq $eodata_str4)) + || $size == 0) { print LOG "OK\n"; } else { print LOG "NO\n"; } close(OUT) || die "Couldn't close output\n"; $need_hdr = 1; } else { # # Print the whole of the buffer to the extracted file. If # bytes remaining after being decremented is zero this was # last block and was all data. Print message, close file # and set header needed flag. # print OUT $buf; $size -= 512; unless($size) { print LOG "OK\n"; close(OUT) || die "Couldn't close output\n"; $need_hdr = 1; } } } } # # If there are no more blocks to read but the last file being restored # was not completed, put out error message. # unless($need_hdr) { print LOG "FAILED - $size bytes lost\n\n", "ERROR - tar archive terminated abnormally\n\n"; print "ERROR - tar archive terminated abnormally\n", "$object truncated, $size bytes lost\n"; } # # If the checksum of the last header processed was bad, print some # indication of the fact. # if($bad_checksum) { print LOG "\nERROR - header checksum failed, aborting\n\n"; print "ERROR - header checksum failed, $object\n"; } # # Close tar archive, examine hash containing the objects that had # duplicate names. Second and subsequent duplicate objects had "__dup1 +", # "__dup2" etc. appended. Now rename the original with "__dup0" append +ed. # Print some stats # close(TAR) || die "Can't close tar archive\n"; foreach $orig (sort keys %dups) { $new = $orig . "__dup0"; print LOG "$orig had $dups{\"$orig\"} duplicate names, renaming to + $new "; if(rename "$dest_dir/$orig", "$dest_dir/$new") { print LOG "OK\n"; } else { print LOG "FAILED: $!\n"; } } close(LOG);

Obviously this was written for the peculiarities of the Pr1me utility but you might be able to adapt it. It was written a long time ago before I had heard of strictures etc.

I hope this is of use.

Cheers,

JohnGG

  • Comment on Re: [OT] Tar file with non-identical duplicate files and no paths?
  • Download Code

Replies are listed 'Best First'.
Re^2: [OT] Tar file with non-identical duplicate files and no paths?
by RMGir (Prior) on Sep 13, 2008 at 12:21 UTC
    Now that deserves all the ++'s it can get!

    Mike

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://711085]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-24 23:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found