Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

comparing two files for duplicate entries

by Angharad (Pilgrim)
on Sep 27, 2006 at 16:18 UTC ( #575187=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

Hi there
I am needing to compare two text files. They look something like this
object1 45.88 object2 45.12
And so on for file 1
object4 23.12 object1 21.56
And so on for file 2.
I am interested in printing out to another file all those cases where the same 'object' (for example, object1) is in both files plus the scores associated with that object.
for example - I want to be able to pull out and print off
object1 45.88 21.56
But not the info for object2 and 4 as they are not present in both files.
What i usually do in such cases is to open up one file and then for each variable of interest in that file, search though the entire contents of the other file for that one variable and then go to the next variable on the first list and so on. However, I am aware that is is not a terribly efficient way of doing things and I would appreciate any suggestions as to how to write a better program for this task.

Replies are listed 'Best First'.
Re: comparing two files for duplicate entries
by Melly (Hermit) on Sep 27, 2006 at 16:34 UTC

    I don't know how well it would scale, but I'd build a hash from the first file ($objects{'object1'}=23.12;), then scan the second file. If the object in the second file has a defined hash, then print out the hashkey and both values..

    Untested code... and I'm assuming a space delim. as per your examples..

    open(FILE, "file1"); while(<FILE>){ if(/(\S+)\s+(\S+)/){ $hash{$1} = $2; } } close FILE; open(FILE, "file2"); while(<FILE>){ if(/(\S+)\s+(\S+)/){ print "$1: $hash{$1} $2\n" if defined $hash{$1}; } }
    Tom Melly,

      Yup, it's really just that simple (well, maybe exists rather than defined; but that's a minor nit). If your files are really, really big you probably want to use something like Berkeley_DB or one of the other DBM modules rather than reading everything into memory, but that's just an implementation detail; the basic algorithm remains the same.

      or, just slightly different:
      open(FILE1, "file1"); open(FILE2, "file2"); while(<FILE1>){ /(\S*)\s+(\S*)/; $hash{$1}=$2; } while(<FILE2>){ /(\S*)\s+(\S*)/; print "$1 $hash{$1} $2\n" if $hash{$1} }

      "one who asks a question is a fool for five minutes; one who does not ask a question remains a fool forever."

      mk at perl dot org dot br
Re: comparing two files for duplicate entries
by davido (Cardinal) on Sep 27, 2006 at 17:45 UTC

    This is even less memory efficient, but I couldn't resist turning your problem into a golfed one-liner. I'm sure someone else will squeeze a few extra characters out of it:

    perl -ane "push@{$h{$F[0]}},$_;END{while(($k,$v)=each%h){print@{$v}if@ +{$v}>1}}" dat1.txt dat2.txt

    -a = autosplit into @F. -n means wrap the -e code in a while(<>){.....} loop. So as this one-liner iterates over the two (or more) files, it pushes each line into an anonymous array held in a hash where the keys are the "objectN" (the first element of @F).

    After the first implicit while. loop (-n) finishes, the END{} block is executed. Here we test each hash element to see if its anonymous array holds more than one element. If it does, print the array. We're taking advantage of the fact that each array element still contains the \n newline from the original file's line endings, and that's why "print @ARRAY." results in one element per print-out line.

    I hope my description of this solution helps, but you can also brush up on perlrun for more details. There are a couple of caveats with this one-liner. First, both files are slurped into a hash in their entirety. Second, the output is in no particular order.


      based on my original code! (:
      perl -ne '/(\S*)\s+(\S*)/;(!$h{$1})?$h{$1}=$2:print "$1 $h{$1} $2\n";' file1 file2

      "one who asks a question is a fool for five minutes; one who does not ask a question remains a fool forever."

      mk at perl dot org dot br

        That one formats nicely. It does have a quirk with regards to always printing the "value" of the first find next to the current find for each of multiple repeats. That's a mouthful, let me demonstrate with a contrived data set:

        file1..... test1 abc test2 def test3 ghi test4 jkl file2..... test1 ghi test3 jkl test3 mno

        And the output.....

        test1 abc ghi test3 ghi jkl test3 ghi mno

        As you can see, test3's "ghi" (the first sequence found) gets repeated for each 'test3' found. Not that there's anything wrong with that. ;)

        If you use the -a switch, you will shave off a few more keystrokes from your solution though, and that's got to be worth something!

        perl -ane '($a,$b)=@F;!$h{$a}?$h{$a}=$b:print"$a $h{$a} $b\n"' file1 f +ile2

        I do like your solution since it preserves order and formats nicely. Good job.


Re: comparing two files for duplicate entries
by fmerges (Chaplain) on Sep 27, 2006 at 16:57 UTC


    Updated: removed code cause the first code example may fit better, but replacing defined in favor of exists

    If using some kind of db file or solution, I would use first the DBM one, because the BerkeleyDB, will need to have installed the libraries, instead DBM comes with standard perl.


    fmerges at

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://575187]
Approved by chargrill
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2022-05-18 19:29 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (71 votes). Check out past polls.