Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comparing two huges files

by vamsikrishna (Initiate)
on Jan 28, 2008 at 03:47 UTC ( #664622=perlquestion: print w/replies, xml ) Need Help??

vamsikrishna has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Currently I'm facing problem with comparing two huges files on a particular key column. One file consists of 10k records and other one 18million records. Both files are | (pipe) delimited. I am comparing based on the first column in the two files and redirecting them to two separate files. If Key columns are same it has to pick the record from 10k records file and send it to one file. If the Key columns are not matching ie., the key column is present in 18 million records file but not in 10k records file, it has to go into another file.

Here I'm pasting the query what I have written, taking more time.

#!/usr/bin/perl $oldFile1 = $ARGV[0]; $newFile1 = $ARGV[1]; $oldFile2 = "unchanged.txt"; $changes1 = "changed.txt"; if (!$oldFile1) { printf("Error: No old file is provided\n"); exit(1); } if (!$newFile1) { printf("Error: No new file is provided\n"); exit(1); } open(OLDFILE1, $oldFile1) or die("Can't open $oldFile1 for input: $!") +; open(NEWFILE1, $newFile1) or die("Can't open $newFile1 for input: $!") +; open(OLDFILE2, ">$oldFile2") or die("Can't open $oldFile2 for output: +$!"); open(CHANGES1, ">$changes1") or die("Can't open $changes1 for output: +$!"); foreach (<OLDFILE1>) { $line1 = $_; chomp $line1; ($x,$y,$z,$a,$b,$c)=split(/\|/,$line1); foreach (<NEWFILE1>) { $line2 = $_; chomp $line2; ($x1,$y1,$z1,$a1,$b1,$c1)=split(/\|/,$line2); if ($x eq $x1) { print OLDFILE2 "$line1\n"; } else { print CHANGES1 "$line2\n"; } } } close(OLDFILE1); close(NEWFILE1); close(OLDFILE2); close(CHANGES1);Hello,
Currently I'm facing problem with comparing two huges files on a particular key column. One file consists of 10k records and other one 18million records. Both files are | (pipe) delimited. I am comparing based on the first column in the two files and redirecting them to two separate files. If Key columns are same it has to pick the record from 10k records file and send it to one file. If the Key columns are not matching ie., the key column is present in 18 million records file but not in 10k records file, it has to go into another file.

Here I'm pasting the query what I have written, taking more time.

#!/usr/bin/perl $oldFile1 = $ARGV[0]; $newFile1 = $ARGV[1]; $oldFile2 = "unchanged.txt"; $changes1 = "changed.txt"; if (!$oldFile1) { printf("Error: No old file is provided\n"); exit(1); } if (!$newFile1) { printf("Error: No new file is provided\n"); exit(1); } open(OLDFILE1, $oldFile1) or die("Can't open $oldFile1 for input: $!") +; open(NEWFILE1, $newFile1) or die("Can't open $newFile1 for input: $!") +; open(OLDFILE2, ">$oldFile2") or die("Can't open $oldFile2 for output: +$!"); open(CHANGES1, ">$changes1") or die("Can't open $changes1 for output: +$!"); foreach (<OLDFILE1>) { $line1 = $_; chomp $line1; ($x,$y,$z,$a,$b,$c)=split(/\|/,$line1); foreach (<NEWFILE1>) { $line2 = $_; chomp $line2; ($x1,$y1,$z1,$a1,$b1,$c1)=split(/\|/,$line2); if ($x eq $x1) { print OLDFILE2 "$line1\n"; } else { print CHANGES1 "$line2\n"; } } } close(OLDFILE1); close(NEWFILE1); close(OLDFILE2); close(CHANGES1);

Replies are listed 'Best First'.
Re: comparing two huges files
by GrandFather (Sage) on Jan 28, 2008 at 03:55 UTC

    This sounds very like how to find differences between two huge files. Maybe you could read that thread and find what you need, or perhaps you should talk to your workmate/classmate and see how he solved it?

    Update: hmm, on second thoughts it's not the same - it was hard to tell because of your rubbish formatting, sorry.

    Build a hash from the first (smaller) file, then use it in a single pass through the second (larger) file to figure out where stuff goes. Consider:

    use strict; use warnings; #Hello, # #Currently I'm facing problem with comparing two huges files on a part +icular key #column. One file consists of 10k records and other one 18million reco +rds. Both #files are | (pipe) delimited. I am comparing based on the first colum +n in the #two files and redirecting them to two separate files. #If Key columns are same it has to pick the record from 10k records fi +le and send #it to one file. #If the Key columns are not matching ie., the key column is present in + 18 million #records file but not in 10k records file, it has to go into another f +ile. # #Here I'm pasting the query what I have written, taking more time. my $oldFile1 = <<DAT; 1|oldFile|another field 5|oldFile DAT my $newFile1 = <<DAT; 1|newFile1|z 2|newFile1|x 3|newFile1|y 4|newFile1|p DAT my $oldFile2; my $changes1; open OLDFILE1, '<', \$oldFile1; # Build the reference hash from the 'small' file my %oldKeys; while (<OLDFILE1>) { chomp; my ($key, $tail) = split /\|/, $_, 2; if (exists $oldKeys{$key}) { warn "Key $key duplicated. Duplicate ignored!\n"; next; } $oldKeys{$key} = $tail; } close OLDFILE1; # Process the new file open NEWFILE1, '<', \$newFile1; open OLDFILE2, '>', \$oldFile2; open CHANGES1, '>', \$changes1; while (<NEWFILE1>) { chomp; my ($key, $tail) = split /\|/, $_, 2; if (exists $oldKeys{$key}) { print OLDFILE2 "$key|$oldKeys{$key}\n"; } else { print CHANGES1 "$key|$tail\n"; } } close (NEWFILE1); close (OLDFILE2); close (CHANGES1); print "OLDFILE2:\n$oldFile2\n\n"; print "CHANGES1:\n$changes1\n\n";

    prints:

    OLDFILE2: 1|oldFile|another field CHANGES1: 2|newFile1|x 3|newFile1|y 4|newFile1|p

    Perl is environmentally friendly - it saves trees
Re: comparing two huges files
by grep (Monsignor) on Jan 28, 2008 at 04:08 UTC
    It's a little difficult to understand the specific details (that and it's hard to read w/o <code> tags), but I think I can figure out enough to help you.

    Create a hash. Read the 10K file first. Use the 1st col as the key and the rest of the record as the value.

    Then when you loop over the second file if the first col exists in the original hash, write whatever record you want to the file.

    Here is some code - It may not do exactly what you want, but that is because I'm guessing your spec.

    ## UNTESTED use strict; my %hash; open(FH,'<','oldfile') or die "$!\n"; foreach (<FH>) { chomp; my ($key,$data) = split(/\|/,$_,2); $hash{$key} = $data; } close FH; open(NEW,'<','newfile') or die "$_\n"; open(OUT,'>','outfile') or die "$_\n"; foreach (<NEW>) { chomp; my ($key) = split(/\|/,$_,2); if ( exists $hash{$key} ) { print OUT "$hash{$key}\n"; } }
    grep
    One dead unjugged rabbit fish later...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://664622]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2020-07-09 12:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?