Pulling out data from one file thats not in another

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

Hi there I have two files. For the most part they hold the same data but there are some items in one that are not in the other. I would like to pull out those items and place them in a text file for further processing. Here is the script as I have it right now

#!/usr/local/bin/perl -w

use Exception;
use strict;
use warnings;
use English;
use FileHandle;

my $master = shift;

my $completed = shift;

open(MASTER, "$master") || die "Unable to open $master file :$!\n";
#open(COMPLETE, "$completed") || die "Unable to open $completed file :
+$!\n";

# compare the two files and then print out
# those in master file that are not in completed file


if (! $completed || ! -e $completed)
{
    die new Exception("Unable to  open $completed file :$!\n");
}

warn "# Reading 'COMPLETED' data";
my $hComData = getComData($completed);
warn "# Got 'COMPLETED' data: ".scalar (keys %$hComData);

# lets go though file

while(<MASTER>)
{
    my @entries = split(/\n/, $_);

    my $entry = $entries[0];

    # lets do a test print here

    #print "$entry\n";

    my $h = $hComData->{$entry};

    if(!$h)
    {
    print "$entry\n";

    }
    #else
    #{
    #print "$entry\n";
    #}


}


#################################

sub getComData
{
    my ($fIn) = @ARG;
    
    my $fh = new FileHandle($fIn)
    or die "";
    
    my $hData = {};

    my $check = 1;
    
    while (my $line = $fh->getline) 
    {
    my @cols = split(/\n/,$line);

    #print "test $cols[0]\n";         

    my $hEntry = {
        'chain' => $cols[0],
        'exists' => $check++,
        };

    #print "$check\n";

    my ($chn, $ex) = sort ($hEntry->{chain}, $hEntry->{exists});
    
    $hData->{$chn} = $hEntry;
    
    }

    
    return $hData;

}
[download]

In a nutshell the script isn't working properly and I can't see quite what I'm doing wrong. Here is a 'test' master file

1ab4A
1ao8A
1aoeA
1bjtA
1jkxA
1juvA
1mejA
1meoA
1n0uA
1obhA
1pjqA
1qzfA
[download]

and test 'completed' file

1ab4A
1aoeA
1bjtA
1obhA
1qzfA
[download]

Here the output for the program as it is at the moment

1ab4A
1ao8A
1jkxA
1juvA
1mejA
1meoA
1n0uA
1pjqA
[download]

Which is completely wrong as I need to print out all those items in the master file that are not in the completed file as well and, as you can see, thats not happening here. All thoughts/advice much appreciated

Comment on Pulling out data from one file thats not in another Select or Download Code

Replies are listed 'Best First'.
Re: Pulling out data from one file thats not in another by toolic (Bishop) on Apr 27, 2010 at 15:08 UTC
The problem with your code is that your `getComData` sub does not populate your hash ref as you expect it to. Prove this to yourself by printing the data structure using Data::Dumper (Tip #4 from the Basic debugging checklist): `use Data::Dumper; print Dumper($hComData);` [download] Your sort looks very strange: `my ($chn, $ex) = sort ($hEntry->{chain}, $hEntry->{exists});` [download] If you don't need a Hash-of-Hashes, just use a simple hash.	[reply] [d/l] [select]
Re: Pulling out data from one file thats not in another by kennethk (Abbot) on Apr 27, 2010 at 14:55 UTC
The simple answer for your problem would appear to be diff, a standard *nix command and available as a GUI in Windows as WinDiff. Is there a reason you cannot use these rather than reinventing the wheel? If you need to do it in Perl, this is a FAQ: How do I compute the difference of two arrays? How do I compute the intersection of two arrays?	[reply]
Re^2: Pulling out data from one file thats not in another by Angharad (Pilgrim) on Apr 27, 2010 at 14:59 UTC
I tried 'diff' -the trouble there is that the items in the two files don't always appear in the same order. It simply doesn't work. I'll take a peak at the links you suggested though.	[reply]
Re^3: Pulling out data from one file thats not in another by kennethk (Abbot) on Apr 27, 2010 at 15:14 UTC
By using a hash as per the FAQ, the intersection/difference calculation will be order-independent. You will have to compare the resulting hash (called `%count` in the FAQ) against a given file's content to determine which file lacked the line in question. Note that the FAQ's code fails if either array has repeat entries. Alternatively, you can use bit operations rather than simple incrementation to encode a little extra info. The FAQ code structure is more immediately obvious, but this may do more of what you want: #!/usr/bin/perl use strict; use warnings; my $master = shift; my $completed = shift; open my $mh, '<', $master or die "Open fail on $master: $!"; my @master_lines = <$mh>; chomp @master_lines; open my $ch, '<', $completed or die "Open fail on $completed: $!"; my @completed_lines = <$ch>; chomp @completed_lines; my %count; for my $element (@master_lines) { $count{$element}\|=1; } for my $element (@completed_lines) { $count{$element}\|=2; } print "$master only:\n"; for my $element (@master_lines) { next if $count{$element} & 2; print "$element\n"; } print "$completed only:\n"; for my $element (@completed_lines) { next if $count{$element} & 1; print "$element\n"; } [download]	[reply] [d/l] [select]
Re^3: Pulling out data from one file thats not in another by rubasov (Friar) on Apr 27, 2010 at 15:32 UTC
There are already several tools to achieve what you want, writing your own is probably needless. A standard Unix-like solution (works under bash): `$ diff <( sort master ) <( sort completed ) \| grep '^<' \| cut -d ' ' - +f2-` [download] Depending on your needs you may want to use `sort -u` instead of a simple `sort`. Or if you're under some Debian-derivative distro just install the `moreutils` package and use `combine`: `$ combine master not completed 1ao8A 1jkxA 1juvA 1mejA 1meoA 1n0uA 1pjqA` [download] Hope that helps.	[reply] [d/l] [select]
Re^4: Pulling out data from one file thats not in another by choroba (Cardinal) on Apr 27, 2010 at 15:53 UTC
Re: Pulling out data from one file thats not in another by sierpinski (Chaplain) on Apr 27, 2010 at 14:57 UTC
Any reason why you can't sort them and just use 'diff'? I mean if it needs to be Perl, you could still use your OS's diff command inside of Perl. `diff file1 file2 > diff.out` [download]	[reply] [d/l]
Re: Pulling out data from one file thats not in another by ig (Vicar) on Apr 27, 2010 at 16:02 UTC
I would probably do something like the following: `#!/usr/bin/perl # use strict; use warnings; (@ARGV == 2) or die "USAGE: $0 master complete\n"; my ($master, $complete) = @ARGV; open(my $fh_complete, '<', $complete) or die "$complete: $!"; open(my $fh_master, '<', $master ) or die "$master: $!"; my %complete = map { $_=> 1 } <$fh_complete>; my @incomplete = grep { !$complete{$_} } <$fh_master>; print @incomplete;` [download] update: kennethk pointed out in a private message that the requirement might be an XOR operation (to paraphrase: all lines from both files (master and complete) that are not in both files), rather than all lines in the master file that are not also in the complete file. The program above provides the latter. update: maybe it would be clearer to say: all lines that appear in either and only one of the two files. It is difficult to be both simple and unambiguous with English.	[reply] [d/l]
Re: Pulling out data from one file thats not in another by almut (Canon) on Apr 27, 2010 at 16:31 UTC
`$ perl -nle '(1..eof)?$h{$_}++:$h{$_}\|\|print' completed master 1ao8A 1jkxA 1juvA 1mejA 1meoA 1n0uA 1pjqA` [download] (sorry, couldn't resist golfing :)	[reply] [d/l]
Re: Pulling out data from one file thats not in another by graff (Chancellor) on Apr 27, 2010 at 18:37 UTC
The problem that the OP wants to solve is just one instance of a very general problem: doing "set arithmetic" on two lists: intersection, union, exclusive-or. I tend to encounter problems of this type so often in my work that I wrote a very general tool to address it -- you can find it here: cmpcol. It allows that one or both input lists may actually be flat tables (possibly with varied field delimiters), where just one or more columns are of interest for doing the set-arithmetic.	[reply]


"be consistent"
	PerlMonks