Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Pulling out data from one file thats not in another

by Angharad (Pilgrim)
on Apr 27, 2010 at 14:39 UTC ( [id://837116]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

Hi there I have two files. For the most part they hold the same data but there are some items in one that are not in the other. I would like to pull out those items and place them in a text file for further processing. Here is the script as I have it right now
#!/usr/local/bin/perl -w use Exception; use strict; use warnings; use English; use FileHandle; my $master = shift; my $completed = shift; open(MASTER, "$master") || die "Unable to open $master file :$!\n"; #open(COMPLETE, "$completed") || die "Unable to open $completed file : +$!\n"; # compare the two files and then print out # those in master file that are not in completed file if (! $completed || ! -e $completed) { die new Exception("Unable to open $completed file :$!\n"); } warn "# Reading 'COMPLETED' data"; my $hComData = getComData($completed); warn "# Got 'COMPLETED' data: ".scalar (keys %$hComData); # lets go though file while(<MASTER>) { my @entries = split(/\n/, $_); my $entry = $entries[0]; # lets do a test print here #print "$entry\n"; my $h = $hComData->{$entry}; if(!$h) { print "$entry\n"; } #else #{ #print "$entry\n"; #} } ################################# sub getComData { my ($fIn) = @ARG; my $fh = new FileHandle($fIn) or die ""; my $hData = {}; my $check = 1; while (my $line = $fh->getline) { my @cols = split(/\n/,$line); #print "test $cols[0]\n"; my $hEntry = { 'chain' => $cols[0], 'exists' => $check++, }; #print "$check\n"; my ($chn, $ex) = sort ($hEntry->{chain}, $hEntry->{exists}); $hData->{$chn} = $hEntry; } return $hData; }
In a nutshell the script isn't working properly and I can't see quite what I'm doing wrong. Here is a 'test' master file
1ab4A 1ao8A 1aoeA 1bjtA 1jkxA 1juvA 1mejA 1meoA 1n0uA 1obhA 1pjqA 1qzfA
and test 'completed' file
1ab4A 1aoeA 1bjtA 1obhA 1qzfA
Here the output for the program as it is at the moment
1ab4A 1ao8A 1jkxA 1juvA 1mejA 1meoA 1n0uA 1pjqA
Which is completely wrong as I need to print out all those items in the master file that are not in the completed file as well and, as you can see, thats not happening here. All thoughts/advice much appreciated

Replies are listed 'Best First'.
Re: Pulling out data from one file thats not in another
by toolic (Bishop) on Apr 27, 2010 at 15:08 UTC
    The problem with your code is that your getComData sub does not populate your hash ref as you expect it to. Prove this to yourself by printing the data structure using Data::Dumper (Tip #4 from the Basic debugging checklist):
    use Data::Dumper; print Dumper($hComData);

    Your sort looks very strange:

    my ($chn, $ex) = sort ($hEntry->{chain}, $hEntry->{exists});

    If you don't need a Hash-of-Hashes, just use a simple hash.

Re: Pulling out data from one file thats not in another
by kennethk (Abbot) on Apr 27, 2010 at 14:55 UTC
      I tried 'diff' -the trouble there is that the items in the two files don't always appear in the same order. It simply doesn't work. I'll take a peak at the links you suggested though.
        By using a hash as per the FAQ, the intersection/difference calculation will be order-independent. You will have to compare the resulting hash (called %count in the FAQ) against a given file's content to determine which file lacked the line in question. Note that the FAQ's code fails if either array has repeat entries.

        Alternatively, you can use bit operations rather than simple incrementation to encode a little extra info. The FAQ code structure is more immediately obvious, but this may do more of what you want:

        #!/usr/bin/perl use strict; use warnings; my $master = shift; my $completed = shift; open my $mh, '<', $master or die "Open fail on $master: $!"; my @master_lines = <$mh>; chomp @master_lines; open my $ch, '<', $completed or die "Open fail on $completed: $!"; my @completed_lines = <$ch>; chomp @completed_lines; my %count; for my $element (@master_lines) { $count{$element}|=1; } for my $element (@completed_lines) { $count{$element}|=2; } print "$master only:\n"; for my $element (@master_lines) { next if $count{$element} & 2; print "$element\n"; } print "$completed only:\n"; for my $element (@completed_lines) { next if $count{$element} & 1; print "$element\n"; }

        There are already several tools to achieve what you want, writing your own is probably needless.

        A standard Unix-like solution (works under bash):
        $ diff <( sort master ) <( sort completed ) | grep '^<' | cut -d ' ' - +f2-

        Depending on your needs you may want to use sort -u instead of a simple sort.

        Or if you're under some Debian-derivative distro just install the moreutils package and use combine:

        $ combine master not completed 1ao8A 1jkxA 1juvA 1mejA 1meoA 1n0uA 1pjqA

        Hope that helps.

Re: Pulling out data from one file thats not in another
by sierpinski (Chaplain) on Apr 27, 2010 at 14:57 UTC
    Any reason why you can't sort them and just use 'diff'? I mean if it needs to be Perl, you could still use your OS's diff command inside of Perl.
    diff file1 file2 > diff.out
Re: Pulling out data from one file thats not in another
by ig (Vicar) on Apr 27, 2010 at 16:02 UTC

    I would probably do something like the following:

    #!/usr/bin/perl # use strict; use warnings; (@ARGV == 2) or die "USAGE: $0 master complete\n"; my ($master, $complete) = @ARGV; open(my $fh_complete, '<', $complete) or die "$complete: $!"; open(my $fh_master, '<', $master ) or die "$master: $!"; my %complete = map { $_=> 1 } <$fh_complete>; my @incomplete = grep { !$complete{$_} } <$fh_master>; print @incomplete;

    update: kennethk pointed out in a private message that the requirement might be an XOR operation (to paraphrase: all lines from both files (master and complete) that are not in both files), rather than all lines in the master file that are not also in the complete file. The program above provides the latter.

    update: maybe it would be clearer to say: all lines that appear in either and only one of the two files. It is difficult to be both simple and unambiguous with English.

Re: Pulling out data from one file thats not in another
by almut (Canon) on Apr 27, 2010 at 16:31 UTC
    $ perl -nle '(1..eof)?$h{$_}++:$h{$_}||print' completed master 1ao8A 1jkxA 1juvA 1mejA 1meoA 1n0uA 1pjqA

    (sorry, couldn't resist golfing :)

Re: Pulling out data from one file thats not in another
by graff (Chancellor) on Apr 27, 2010 at 18:37 UTC
    The problem that the OP wants to solve is just one instance of a very general problem: doing "set arithmetic" on two lists: intersection, union, exclusive-or.

    I tend to encounter problems of this type so often in my work that I wrote a very general tool to address it -- you can find it here: cmpcol.

    It allows that one or both input lists may actually be flat tables (possibly with varied field delimiters), where just one or more columns are of interest for doing the set-arithmetic.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://837116]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-18 19:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found