Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: comparing two files for duplicate entries

by davido (Cardinal)
on Sep 27, 2006 at 17:45 UTC ( [id://575204]=note: print w/replies, xml ) Need Help??


in reply to comparing two files for duplicate entries

This is even less memory efficient, but I couldn't resist turning your problem into a golfed one-liner. I'm sure someone else will squeeze a few extra characters out of it:

perl -ane "push@{$h{$F[0]}},$_;END{while(($k,$v)=each%h){print@{$v}if@ +{$v}>1}}" dat1.txt dat2.txt

-a = autosplit into @F. -n means wrap the -e code in a while(<>){.....} loop. So as this one-liner iterates over the two (or more) files, it pushes each line into an anonymous array held in a hash where the keys are the "objectN" (the first element of @F).

After the first implicit while. loop (-n) finishes, the END{} block is executed. Here we test each hash element to see if its anonymous array holds more than one element. If it does, print the array. We're taking advantage of the fact that each array element still contains the \n newline from the original file's line endings, and that's why "print @ARRAY." results in one element per print-out line.

I hope my description of this solution helps, but you can also brush up on perlrun for more details. There are a couple of caveats with this one-liner. First, both files are slurped into a hash in their entirety. Second, the output is in no particular order.


Dave

Replies are listed 'Best First'.
Re^2: comparing two files for duplicate entries
by mk. (Friar) on Sep 27, 2006 at 21:21 UTC
    based on my original code! (:
    perl -ne '/(\S*)\s+(\S*)/;(!$h{$1})?$h{$1}=$2:print "$1 $h{$1} $2\n";' file1 file2


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "one who asks a question is a fool for five minutes; one who does not ask a question remains a fool forever."

    mk at perl dot org dot br

      That one formats nicely. It does have a quirk with regards to always printing the "value" of the first find next to the current find for each of multiple repeats. That's a mouthful, let me demonstrate with a contrived data set:

      file1..... test1 abc test2 def test3 ghi test4 jkl file2..... test1 ghi test3 jkl test3 mno

      And the output.....

      test1 abc ghi test3 ghi jkl test3 ghi mno

      As you can see, test3's "ghi" (the first sequence found) gets repeated for each 'test3' found. Not that there's anything wrong with that. ;)

      If you use the -a switch, you will shave off a few more keystrokes from your solution though, and that's got to be worth something!

      perl -ane '($a,$b)=@F;!$h{$a}?$h{$a}=$b:print"$a $h{$a} $b\n"' file1 f +ile2

      I do like your solution since it preserves order and formats nicely. Good job.


      Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://575204]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-24 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found