Re: [OT] Tar file with non-identical duplicate files and no paths?

I don't know if this is any help at all but I had to write a script to restore files from archives supposedy written in tar format on a Pr1me mini. The trouble was, the utility had truncated all of the filenames to some arbitrary length so there were loads of duplicates to resolve.
#!/usr/local/bin/perl -w
#
# Script requires two arguments, the name of the tar archive and the
# name of the directory into which the archive should be restored.
#
if($#ARGV != 1) {die "Usage: $0 tar-file dest-dir\n"}
$tar_file = shift @ARGV;
$dest_dir = shift @ARGV;
#
# Check that we can open the archive and that the directory is valid. 
+Open
# a log file to record events.
#
use FileHandle;
open(TAR, "$tar_file") || die "Can't open tar archive\n";
die "$dest_dir is not a directory\n" unless(-d $dest_dir);
open(LOG, ">extract_log") || die "Can't open log file\n";
LOG->autoflush(1);
#
# Unset flags 'cos we haven't got to end of archive yet and we haven't
# found a bad checksum. Set up string values to check that we have
# reached the end of data for each file in the archive.
#
$eot = 0;
$bad_checksum = 0;
$eodata_str1 = pack "CC", 0, 128;
$eodata_str2 = pack "CC", 128, 128;
$eodata_str3 = pack "C", 0;
$eodata_str4 = pack "C", 128;
#
# Set flag so we look for a header at beginning of archive, initialise
# hash of objects in archive with duplicate names (because the utility
# on the Prime mini that produced the archives truncated filenames
# after 14 characters). Archive written in blocks of 512 bytes so
# read each block into a buffer. Fatal error if unable to read 512 byt
+es.
#
$need_hdr = 1;
%dups = ();
while(($len = read TAR, $buf, 512) && ! $bad_checksum)
{
    die "Error reading archive\n" if $len != 512;
    #
    # Looking for header?
    #
    if($need_hdr)
    {
        #
        # Archive terminates with two blocks of Ascii NULLs, wind up
        # if second consecutive NULL block found when a header block i
+s
        # expected, set flag if first block. Blocks of NULLs in data d
+o
        # not matter as the data is copied out until the byte count in
        # the header is satisfied.
        #
        unless(unpack "C512", $buf)
        {
            if($eot)
            {
                print LOG "\nEnd of archive reached\n\n";
                last;                         # Finished reading block
+s.
            }
            else
            {
                $eot = 1;
                next;                         # Get next block.
            }
        }
        #
        # Name of object in archive is held as a C-style NULL terminat
+ed
        # string in the first 100 bytes of the header. Strip NULL onwa
+rds
        # and strip any trailing "/" because, unlike UNIX tar files, t
+he
        # Prime version does not seem to have separate entries for
        # directories.
        #
        $object = substr($buf, 0, 100);
        $object =~ s/\x00.*//;
        $object =~ s/\/$//;
        #
        # Size of file held as octal value.
        #
        $size = oct(substr($buf, 124, 11));
        #
        # Header checksum held as octal vaue. Calculated as sum of Asc
+ii
        # values of all bytes in the header assuming that the checksum
        # field is all blanks (spaces). Save checksum and replace with
        # spaces. Initialse calculated checksum, split header into
        # individual characters and sum their ordinal values.
        #
        $chksum = substr($buf, 147, 8);
        substr($buf, 147, 8) = "        ";
        $calc_chksum = 0;
        @hdr_chars = split('', $buf);
        foreach $char (@hdr_chars)
        {
            $calc_chksum += ord($char);
        }
        #
        # Print name and size of object to log file, print indication
        # of whether checksum was OK or not. If it was bad we don't
        # want to carry on because data could also be corrupt.
        #
        print LOG "$object <-> $size ";
        if($calc_chksum == oct($chksum))
        {
            print LOG "CHKOK ";
        }
        else
        {
            print LOG "CHKNO \n";
            $bad_checksum = 1;
            last;
        }
        #
        # Initialise target directory from destination directory given
+ as
        # an argument. Split object into path elements on '/', pop fil
+e
        # name off end as last element.
        #
        $target_dir = $dest_dir;
        @tree = split('/', $object);
        $file = pop(@tree);
        #
        # Check that file name is not "." or "..", apparently legal in
        # Primos, since we found one. Change to "__dot" or "__dotdot"
        # which are less significant under Unix. Hopefully, there will
        # not be any genuine files called either of those.
        #
        if($file eq ".")
        {
            $file = "__dot";
            print LOG "ill-> $file ";
        }
        if($file eq "..")
        {
            $file = "__dotdot";
            print LOG "ill-> $file ";
        }
        #
        # Get directory elements one at a time concatenating them to t
+he
        # target directory, make non-existent directories if necessary
+.
        #
        while($dir = shift(@tree))
        {
            $target_dir .= "/";
            $target_dir .= $dir;
            mkdir("$target_dir", 0755) unless(-x "$target_dir");
        }
        #
        # Because of the Prime filename truncation problem, check if t
+he
        # object we are about to restore already exists (the target
        # directory specified in the arguments should always be empty
        # as a precaution). If it does exist, increment the value in
        # the hash indexed by object; use value to differentiate names
+.
        #
        if(-e "$target_dir/$file")
        {
            $dups{"$object"}++;
            $file .= "__dup";
            $file .= $dups{"$object"};
            print LOG "dup-> $file ";
        }
        #
        # Open file to be extracted from archive (will create). If the
        # size field in header indicates that data will follow, unset
        # the flag because next block should no be a header. If size i
+s
        # zero however, file was empty so print message and close file
        # again, leaving flag alone.
        #
        open(OUT, ">$target_dir/$file") || die "Couldn't open output\n
+";
        if($size)
        {
            $need_hdr = 0;
        }
        else
        {
            print LOG "EMPTY\n";
            close(OUT) || die "Couldn't close output\n";
        }
    }
    else
    {
        #
        # This is a data block.
        #
        if($size < 512)
        {
            #
            # Are there less than 512 bytes of data remaining to be
            # extracted from archive? Print only required bytes to
            # file using "substr". Check that next pair of bytes (or
            # single byte if 511 bytes were written) match possible
            # strings defined above. Message either way, close file.
            # Set header needed flag.
            #
            print OUT substr($buf, 0, $size);
            if((index $buf, $eodata_str1, $size) == $size ||
               (index $buf, $eodata_str2, $size) == $size ||
               ($size == 511 &&
               ((pack "C", ord(substr($buf, 511, 1))) eq $eodata_str3 
+||
               (pack "C", ord(substr($buf, 511, 1))) eq $eodata_str4))
+ ||
               $size == 0)
            {
                print LOG "OK\n";
            }
            else
            {
                print LOG "NO\n";
            }
            close(OUT) || die "Couldn't close output\n";
            $need_hdr = 1;
        }
        else
        {
            #
            # Print the whole of the buffer to the extracted file. If
            # bytes remaining after being decremented is zero this was
            # last block and was all data. Print message, close file
            # and set header needed flag.
            #
            print OUT $buf;
            $size -= 512;
            unless($size)
            {
                print LOG "OK\n";
                close(OUT) || die "Couldn't close output\n";
                $need_hdr = 1;
            }
        }
    }
}
#
# If there are no more blocks to read but the last file being restored
# was not completed, put out error message.
#
unless($need_hdr)
{
    print LOG "FAILED - $size bytes lost\n\n",
       "ERROR - tar archive terminated abnormally\n\n";
    print "ERROR - tar archive terminated abnormally\n",
       "$object truncated, $size bytes lost\n";
}
#
# If the checksum of the last header processed was bad, print some
# indication of the fact.
#
if($bad_checksum)
{
    print LOG "\nERROR - header checksum failed, aborting\n\n";
    print "ERROR - header checksum failed, $object\n";
}
#
# Close tar archive, examine hash containing the objects that had
# duplicate names. Second and subsequent duplicate objects had "__dup1
+",
# "__dup2" etc. appended. Now rename the original with "__dup0" append
+ed.
# Print some stats
#
close(TAR) || die "Can't close tar archive\n";
foreach $orig (sort keys %dups)
{
    $new = $orig . "__dup0";
    print LOG "$orig had $dups{\"$orig\"} duplicate names, renaming to
+ $new ";
    if(rename "$dest_dir/$orig", "$dest_dir/$new")
    {
        print LOG "OK\n";
    }
    else
    {
        print LOG "FAILED: $!\n";
    }
}
close(LOG);
[download]
Obviously this was written for the peculiarities of the Pr1me utility but you might be able to adapt it. It was written a long time ago before I had heard of strictures etc.
I hope this is of use.
Cheers,
JohnGG
Comment on Re: [OT] Tar file with non-identical duplicate files and no paths? Download Code

Syntactic Confectionery Delight
	PerlMonks