Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Delimited files Comparision

by harishnuti (Beadle)
on Jul 28, 2008 at 03:35 UTC ( [id://700448]=perlquestion: print w/replies, xml ) Need Help??

harishnuti has asked for the wisdom of the Perl Monks concerning the following question:


Hello Monks i need your advice on my below requirement and the suggestions on effort i put to achieve it

* i have 2 delimited files which needs to be compared
* The headers for this delimited files is present in another file
* i need to skip some fields while comparision
* finally i should print the differences between these 2 delimited files i.e. only fields which are different along with header in good report format.
A Extract of File1 shown below
!A90000!,!B00500!,!1!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!2!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!4!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!I912ST!,!SG!,!30/08/2004!,!288!,!0!
!I912ST!,!SN!,!30/08/2004!,!249!,!0!
!I912ST!,!TR!,!30/08/2004!,!56!,!0!
!I912ST!,!WC!,!30/08/2004!,!211!,!0!

A Extract of file2 shown below
!A90000!,!B00500!,!0!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!0!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!3!,!4!,!2!,!23/02/1998!,!2!,!1!,!B005AU!
!A90000!,!B00500!,!4!,!0!,!8!,!23/02/1998!,!2!,!1!,!B005AU!
!I912ST!,!SG!,!30/08/2004!,!248!,!0!
!I912ST!,!SN!,!30/08/2004!,!290!,!0!
!I912ST!,!FZ!,!30/08/2004!,!56!,!0!
!I912ST!,!WC!,!30/08/2004!,!211!,!1!

Header file will be below format
A90000,CT03,1, 1,00240
A90000,CT03,1, 2,00020
A90000,CT03,1, 3,00090
A90000,CT03,1, 4,00690
A90000,CT03,0, 5,01040
A90000,CT03,0, 6,00241
A90000,CT03,0, 7,00243
A90000,CT03,0, 8,00710
A90000,CT03,0, 9,00242
I91200,CW5Y,1, 1,17447
I91200,CW5Y,0, 2,01040
I91200,CW5Y,0, 3,04010
I91200,CW5Y,0, 4,04620
and so on ..... A90000 , I91200 etc , need to generalize this solution
The last column are header for the data in above files
my Code below
# Read header file and store it in hash open(FILE,"header.txt"); while(<FILE>){ chomp; next if (/^\s*$/); my @fields = split(/\,/,$_); push(@{$hashkeys{$fields[0]}{fields}},$fields[4]); # Iam pushing all headers for each table i.e. for field[0] into +hash array } CLOSE FILE; # open file1 and file2 and readthem into Hashes %hash1 and %hash2 as b +elow open FILE,"file1"; while ( <FILE> ){ my @arr = split ( /\,/,$_ ); my $key = join("-->",@arr); $hash1{$key} = 1; } # llly for file2 i.e. %hash2 contains above. # The below arrays gives me lines which are different my @missing1 = grep ! exists $hash1{ $_ }, keys %hash2; my @missing2 = grep ! exists $hash2{ $_ }, keys %hash1; # now iam reformatting the arrays w.r.t header file # Now i want to show output as below.. # Cols which are different in file1 00240 00020 00090 00690 01040 00241 00243 00710 00242 1 2 4 8 # Similarly for file2 # i have done it in lengthy way , with no luck and no formatting

i can paste the whole code, but its in pretty bad shape, if there is any better way to achieve above formatting, i love to know.
iam happy even if any suggestions are given to achieve above formatting, i can try experimenting in my code

Replies are listed 'Best First'.
Re: Delimited files Comparision
by TedPride (Priest) on Jul 28, 2008 at 04:43 UTC
    Not exactly sure what the numbers 1, 2, 4, 8 represent, but here's a 5 or 10 minute hack job:
    use strict; my ($handle, @header, @file1, @file2, $row, $col, @output); ### Load header data open ($handle, 'header.txt') || die "Can't load header file.\n"; while (<$handle>) { @_ = split /,/, $_; $_ = pop @_; chomp; push @header, $_; } close ($handle); ### Load file 1 as array of arrays open ($handle, 'file1.txt') || die "Can't load data file 1.\n"; while (<$handle>) { push @file1, [split /,/, $_]; } close ($handle); ### Load file 2 open ($handle, 'file2.txt') || die "Can't load data file 2.\n"; while (<$handle>) { push @file2, [split /,/, $_]; } close ($handle); ### Output header line print join "\t", @header; print "\n"; ### For each line, output x for fields with a difference for ($row = 0; $row <= $#file1; $row++) { @output = (); for ($col = 0; $col <= $#header; $col++) { push @output, $file1[$row][$col] ne $file2[$row][$col] ? 'x' : ''; } print join "\t", @output; print "\n"; }

      Well , thats Amazing, i would like to thanq for effort and spending some time in looking into my problem..
      sorry to say , the solution works absolutely fine for the above data, but rather i have 2 more points as below.

      * The file's shown is only an extract of large file, actual file has got 1 million records of which A9000 is one entry.
      * The file1 and file2 and header contains different tables data, i have Appended one more set of data ..
      How do i extend the same solution, Can we use Hash to store and then loop on differences?

      i have one more requirement, the date field i should skip based on some position in another file, i can handle this one..
      can i have some idea on the updated set of data, i will try my best to extend your solution.

        Thanks iam able to resolve the issue using methods of hash

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://700448]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (11)
As of 2024-04-23 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found