Trying to remove duplicate rows using hashes

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a file for which I need to remove some rows for which there is an element of duplication.
Example file (tiny compared to the real thing).

d1 c1.1  f1  d1.1
d1 c1.1  f2  d1.2
d2 c1.1  f1  d1.1
d3 c1.1  f1  d1.1
d4 c1.1  f1  d1.1
d4 c1.1  f2  d1.2
d5 c1.1  f4  d1.4
d6 c1.1  f5  d1.5
[download]

For each c1.1 group, I want a print out whereby for each c1.? all duplicate d1.? entries are removed. In other words I'm after something like this

d1 c1.1  f1  d1.1
d1 c1.1  f2  d1.2
d5 c1.1  f4  d1.4
d6 c1.1  f5  d1.5
[download]

The print out should include all four columns Here is what I've attempted so far

#!/usr/bin/perl -w

use strict;
use warnings;
use English;
use FileHandle;

use Exception;

my ($fIn) = $ARGV[0];
    
open(FILE, "$fIn") || die "ERROR: Can't open $fIn  file: $!\n";
    
my %hash;

my $c_id;
my $d_id;
my $f_var;

while(<FILE>)
{
   chomp;

   my @data = split(/\s+/, $_);

   $c_id = $data[1];
   $d_id = $data[3];
   $f_var = $data[2];

   if(!$hash{$c_id}{$f_var})
   {
     $hash{$c_id}{$f_var} = $d_id;

    }

 }

while (( my $k1, my $k2) = each %hash)
{
    print "$k1 ";
    while (( $k2, my $k3) = each %$k2)
    {

    print "$k2 $k3 ";
    }
    print "\n";

}
[download]

But sadly I'm getting an error about not being able to use a string as a HASH ref while 'strict refs' are in use. Could someone please point me in the right direction? Thanks

Comment on Trying to remove duplicate rows using hashes Select or Download Code

Replies are listed 'Best First'.
Re: Trying to remove duplicate rows using hashes by kyle (Abbot) on Oct 21, 2008 at 15:52 UTC
When you output, you use the value of `$k2` as a hash reference, but the second time through the inner loop, its value has been replaced by the key that each returned on the first time through. You need something more like this: `while ( my ($k1, $href) = each %hash ) { while ( my ($k2, $k3) = each %{ $href } ) { } }` [download] Incidentally, you use English without the important `-no_match_vars` option, and you use warnings as well as giving the `-w`, which is a bit redundant. Also, your sample output is in the order that the lines were received, but you won't get your output in that order if you're using each and hashes for storage. There are ways of coping with that, but I can't tell if that's one of your requirements or not.	[reply] [d/l] [select]
Re^2: Trying to remove duplicate rows using hashes by Angharad (Pilgrim) on Oct 21, 2008 at 15:55 UTC
Thanks for all your help so far. Much appreciated. As regards to the print out - you mean I have to use some kind of sort function if I want it looking like my example?	[reply]
Re^3: Trying to remove duplicate rows using hashes by kyle (Abbot) on Oct 21, 2008 at 15:58 UTC
Yes, you could sort them before you output them, or you can sort them separately in another process. The UNIX sort command is very effective for tabular data such as yours.	[reply]
Re: Trying to remove duplicate rows using hashes by jettero (Monsignor) on Oct 21, 2008 at 15:45 UTC
Offtopic, sorry, I'm sure an actual answer will be along very shotly. You don't really use any of these ... Why load them? `use English; use FileHandle; use Exception;` [download] Oh, I have an actual answer I guess. It seems you're stomping on your `$k2` with your second each() call... -Paul	[reply] [d/l] [select]
Re^2: Trying to remove duplicate rows using hashes by Angharad (Pilgrim) on Oct 21, 2008 at 15:47 UTC
oh - well i guess I just cut and paste those from other scripts - but yes in this instance i guess they are rather redundant. EDIT: what do you mean by the "stomping"?	[reply]
Re^3: Trying to remove duplicate rows using hashes by jettero (Monsignor) on Oct 21, 2008 at 18:07 UTC
EDIT: what do you mean by the "stomping"? I'm sure others have already answers this... I mean that your second each call is overwriting your $k2 value so your next call to each is operating on the value you replace $k2 with instead of the hashref you mean to have there. Stomping is a colloquial expression, by which I mean to stay "accidentally replacing" or "stepping on." -Paul	[reply]
Re: Trying to remove duplicate rows using hashes by ccn (Vicar) on Oct 21, 2008 at 16:06 UTC
One line: `ccn@laptop:~$ perl -lane '$H{$F[1].$F[3]}++ or print' file.txt d1 c1.1 f1 d1.1 d1 c1.1 f2 d1.2 d5 c1.1 f4 d1.4 d6 c1.1 f5 d1.5 ccn@laptop:~$` [download]	[reply] [d/l]
Re: Trying to remove duplicate rows using hashes by mje (Curate) on Oct 21, 2008 at 15:57 UTC
I think you'll find you've at least got an error here: `if(!$hash{$c_id}{$f_var}) { $hash{$c_id}{$f_var} = $d_id; }` [download] as I think a) that will always test true (I think you want to look at perldoc -f exists) and b) you are only going to get the first c_id/f_id and you didn't say that is what you wanted - you said remove duplicates.	[reply] [d/l]
Re^2: Trying to remove duplicate rows using hashes by driver8 (Scribe) on Oct 21, 2008 at 19:59 UTC
I think that you are wrong about both of those. Did you test it?	[reply]
Re^2: Trying to remove duplicate rows using hashes by Angharad (Pilgrim) on Oct 21, 2008 at 16:18 UTC
Yes, its duplicates I would like removed.	[reply]
Re: Trying to remove duplicate rows using hashes by swampyankee (Parson) on Oct 21, 2008 at 16:27 UTC
~~On a machine with a decent set of command line tools, you could do something like~~ `sort -u file > sorted_file_without_duplicates` ~~where `file` has your data.~~ As kyle pointed out, this suggestion doesn't meet the OP's needs. Sorry, and please disregard. Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply]
Re^2: Trying to remove duplicate rows using hashes by kyle (Abbot) on Oct 21, 2008 at 16:30 UTC
The OP is not trying to remove identical lines but rather lines that have two of four fields equivalent. In the example given, the lines removed differ in the first field, so "`sort -u`" would not remove them.	[reply] [d/l]


Clear questions and runnable code get the best and fastest answer
	PerlMonks