Comparing two files line by line and exporting the differences from the first file

jzelkowsz has asked for the wisdom of the Perl Monks concerning the following question:

I have two files. One is an HR record of the user's values; the other is a network export of their attributes. I am trying to compare the two files and find the differences attribute by attribute. The sole reliable key is the samaccountname which is present and consistent in every record. I am trying to produce a file like this:

barsu991,title,Director of Cooks
zingk072,symphonyemployeetype,IKP
zingk072,employeenumber,zingk072
zingk072,manager,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
[download]

Where each line of the produced file holds the samaccountname,attribute that is incorrect, and the correct value of the attribute from the HR record. One mistake per line.

I have tried to do this with loops like below but the best I get is a comparison with the last line and not all of them.

open(HR, "<hr.txt") || die "can't open hr";
open(AD, "<ad.txt") || die "can't open ad";
open(COMMAND, ">com.txt") || die "can't open com.txt";

while(<HR>)
{
($samaccountnameHR,$givennameHR,$snHR,$initialsHR,
$employeenumberHR,$symphonyemployeetypeHR,$mailHR,
$titleHR,$departmentHR,$companyHR,$lHR,
$physicaldeliveryofficeHR,$streetaddressHR,$stHR,
$postalcodeHR,$telephonenumberHR,$managerHR)=split(/,$/);
    


while(<AD>)
{
 ($samaccountnameAD,$givennameAD,$snAD,$initialsAD,$employeenumberAD,
 $symphonyemployeetypeAD,$mailAD, $titleAD,$departmentAD,$companyAD,
 $lAD,$physicaldeliveryofficeAD,$streetaddressAD,$stAD,$postalcodeAD,
 $telephonenumberAD,$managerAD)=split(/,$/);
        
    if ($employeenumberHR != $employeenumberAD)
    {
        print "$samaccountnameHR $samaccountnameAD\n";
    }
  }
}
[download]

HR Data:
samaccountname,givenname,sn,initials,employeenumber,
symphonyemployeetype,mail,title,department,company,l,
physicaldeliveryoffice,streetaddress,st,postalcode,
telephonenumber,manager
barsu991,Uttiam,Barski,K,20114598,IKP,
Uttiam.Barski@pulse.org,Director of Cooks,Day Kitchen,
MILIFO,Alpena,Kitchen of the World,400 Baker,WI,50987,
555-555-5555,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
walkl003,Lreblemet,Walker,J,20178941,IKP,
Lreblemet.Walker@pulse.org,Head Cook,Day Kitchen,MILIFO,Alpena,
Kitchen of the World,400 Baker,WI,50987,555-555-5551,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
karss001,Sovyetk,Karsten,Y,20146598,IKP,Sovyetk.Karsten@pulse.org,
Dishwasher,Day Kitchen,MILIFO,Alpena,Kitchen of the World,
205 Willy B. Temple,WI,50987,,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
zingk072,Kovon,Zingerman,K,20113578,IKP,Kovon.Zingerman@pulse.org,
Baker,Day Kitchen,MILIFO,Alpena,Kitchen of the World,
205 Willy B. Temple,WI,50987,,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
peizs194,Synthia,Smite,B,20134743,IKP,Synthia.Peizer@pulse.org,
Broiler Man,Day Kitchen,MILIFO,Alpena,
Kitchen of the World,205 Willy B. Temple,
WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
hutcy231,Yello,Hutchinson,W,20145712,IKP,
Yello Hutchinson,@pulse.org,
Bottle Washer,Day Kitchen,MILIFO,Alpena,
Kitchen of the World,400 Baker,WI,50987,
,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
haserz221,Zebediah,Haserkrilk,L,20125471,IKP,
Zebediah.Haserkrilk@kit.org,
Purchaser,Day Kitchen,MILIFO,Alpena,
Kitchen of the World,400 Baker,WI,50987,
,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
[download]

AD data:
samaccountname,givenname,sn,initials,employeenumber,
symphonyemployeetype,mail,title,department,company,l,
physicaldeliveryoffice,streetaddress,st,postalcode,
telephonenumber,manager
barsu991,Uttiam,Barski,K,20114598,IKP,
William.Barski@pulse.org,Chief of Cooks,Day Kitchen,
MILIFO,Alpena,Kitchen of the World,400 Baker,WI,50987,
555-555-5555,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
walkl003,Larry,Walker,J,,IKP,Larry.Walker@pulse.org,
Cook,Day Kitchen,MILIFO,Alpena,Kitchen of the World,
400 Baker,WI,50987,555-555-5551,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
karss001,Steven,Karsten,Y,20146598,IKP,
Steven.Karsten@pulse.org,Dishw,Day Kitchen,MILIFO,
Alpena,Sully's Kitchen,48720 Belcard,IL,34567,,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
zingk072,Kevin,Zingerman,K,,,Kevin.Zingerman@pulse.org,
Baker,Day Kitchen,MILIFO,Alpena,Kitchen of the World,
205 Willy B. Temple,WI,50987,,
peizs194,Samantha,Smith,B,20134743,IKP,
Samantha.Smith@pulse.org,"Man, Broiler",Day Kitchen,
MILIFO,Alpena,Kitchen of the World,205 Willy B. Temple,
WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
hutcy231,Yaren,Hutchinson,W,20145712,IKP,
Yaren Hutchinson,@pulse.org,Bottle Washer,Day Kitchen,MILIFO,
Alpena,Kitchen of the World,400 Baker,WI,50987,,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
haserz221,Zebediah,Hasermann,L,,IKP,
Zebediah.Haserman@pulse.org,Purchaser,Day Kitchen,MILIFO,
Alpena,Kitchen of the World,400 Baker,WI,50987,555-555-5555,
"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net"
[download]

Comment on Comparing two files line by line and exporting the differences from the first file Select or Download Code

Replies are listed 'Best First'.
Re: Comparing two files line by line and exporting the differences from the first file by Tux (Canon) on Jul 23, 2018 at 12:04 UTC
Using more recentish functionality of Text::CSV_XS, you get quite readable code IMHO: use Text::CSV_XS "csv"; my $hr = csv ( in => "hr.txt", key => "samaccountname", keep_headers => \my @keys, ); my $aoh = csv (in => "ad.txt", bom => 1, on_in => sub { my $sam = $_[1]{samaccountname} or die "No name in AD"; my $ahr = $hr->{$sam}; unless ($ahr) { warn "I got AD data for $sam, not in HR\n"; next; } my @diff = map { [ $_, $ahr->{$_}, $_[1]{$_} ] } grep { $ahr->{$_} ne $_[1]{$_} } @keys; @diff or return; say "Changes for samaccount $sam"; printf " %-22s %-27.27s -> %s\n", @$_ for @diff; }); [download] with the two datafiles you provided, $ perl test.pl Changes for samaccount barsu991 mail Uttiam.Barski@pulse.org -> William.Barski +@pulse.org title Director of Cooks -> Chief of Cooks Changes for samaccount walkl003 givenname Lreblemet -> Larry employeenumber 20178941 -> mail Lreblemet.Walker@pulse.org -> Larry.Walker@p +ulse.org title Head Cook -> Cook Changes for samaccount karss001 givenname Sovyetk -> Steven mail Sovyetk.Karsten@pulse.org -> Steven.Karsten +@pulse.org title Dishwasher -> Dishw physicaldeliveryoffice Kitchen of the World -> Sully's Kitche +n streetaddress 205 Willy B. Temple -> 48720 Belcard st WI -> IL postalcode 50987 -> 34567 Changes for samaccount zingk072 givenname Kovon -> Kevin employeenumber 20113578 -> symphonyemployeetype IKP -> mail Kovon.Zingerman@pulse.org -> Kevin.Zingerma +n@pulse.org manager cn=manager1,ou=users,ou=Kit -> Changes for samaccount peizs194 givenname Synthia -> Samantha sn Smite -> Smith mail Synthia.Peizer@pulse.org -> Samantha.Smith +@pulse.org title Broiler Man -> Man, Broiler Changes for samaccount hutcy231 givenname Yello -> Yaren mail Yello Hutchinson -> Yaren Hutchins +on Changes for samaccount haserz221 sn Haserkrilk -> Hasermann employeenumber 20125471 -> mail Zebediah.Haserkrilk@kit.org -> Zebediah.Haser +man@pulse.org telephonenumber -> 555-555-5555 [download] It is up to you to mold that into a report of your liking Update: If you want to store the changes in a CSV file, change it like this: `my @diff; my $aoh = csv (in => "ad.txt", bom => 1, on_in => sub { my $sam = $_[1]{samaccountname} or die "No name in AD"; my $ahr = $hr->{$sam} or die "I got AD data for $sam, no +t in HR\n"; push @diff, map { [ $sam, $_, $ahr->{$_}, $_[1]{$_} ] } grep { $ahr->{$_} ne $_[1]{$_} } @keys; }); csv (in => \@diff, out => "diff.csv");` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: Comparing two files line by line and exporting the differences from the first file by kcott (Archbishop) on Jul 23, 2018 at 10:58 UTC
G'day jzelkowsz, Here's a solution using Text::CSV (if you have Text::CSV_XS installed it will run faster) and in-memory files (see open). The input data I used is a verbatim copy of what you posted here. #!/usr/bin/env perl use strict; use warnings; use Text::CSV; my ($hr_file, $ad_file, $com_file) = qw{hr.txt ad.txt com.txt}; my (@col_index, %hr_record_for); my $csv = Text::CSV::->new({quote_space => 0}) or die "Can't instantiate a Text::CSV object: ", Text::CSV::->error_diag(); { open my $mem_fh, '<', canonicalise_file_in_memory($hr_file) or die "Can't read in-memory file: $!"; @col_index = @{$csv->getline($mem_fh)}; while (my $row = $csv->getline($mem_fh)) { $hr_record_for{$row->[0]} = $row; } } { open my $mem_fh, '<', canonicalise_file_in_memory($ad_file) or die "Can't read in-memory file: $!"; open my $out_fh, '>', $com_file or die "Can't write '$com_file': $!"; (undef) = $csv->getline($mem_fh); while (my $row = $csv->getline($mem_fh)) { for my $i (1 .. $#col_index) { if ($hr_record_for{$row->[0]}[$i] ne $row->[$i]) { $csv->say($out_fh, [ $row->[0], $col_index[$i], $hr_record_for{$row->[0]}[$i] ]); } } } } sub canonicalise_file_in_memory { my ($file) = @_; open my $fh, '<', $file or die "Can't read '$file': $!"; my $canon; while (<$fh>) { chomp if /,$/; $canon .= $_; } return \$canon; } [download] Output: $ cat com.txt barsu991,mail,Uttiam.Barski@pulse.org barsu991,title,Director of Cooks walkl003,givenname,Lreblemet walkl003,employeenumber,20178941 walkl003,mail,Lreblemet.Walker@pulse.org walkl003,title,Head Cook karss001,givenname,Sovyetk karss001,mail,Sovyetk.Karsten@pulse.org karss001,title,Dishwasher karss001,physicaldeliveryoffice,Kitchen of the World karss001,streetaddress,205 Willy B. Temple karss001,st,WI karss001,postalcode,50987 zingk072,givenname,Kovon zingk072,employeenumber,20113578 zingk072,symphonyemployeetype,IKP zingk072,mail,Kovon.Zingerman@pulse.org zingk072,manager,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" hutcy231,givenname,Yello hutcy231,mail,Yello Hutchinson haserz221,sn,Haserkrilk haserz221,employeenumber,20125471 haserz221,mail,Zebediah.Haserkrilk@kit.org haserz221,telephonenumber, [download] — Ken	[reply] [d/l] [select]
Re: Comparing two files line by line and exporting the differences from the first file by AnomalousMonk (Archbishop) on Jul 23, 2018 at 16:06 UTC
Other monks have posted solutions based on Text::CSV to which I think you should pay close attention. This post is not about the OPed code per se, but about a general approach to debugging code. Are you using warnings and strict with your code? I suspect not. If not, do so (see example code below), then fix the problems these thinking-aids reveal. These modules are useful for all Perl programmers, but especially for novice Perlers. After being sure warnings and strict are enabled, the next thing to do is to be sure you are getting the data you think you're getting. The statement `($samaccountnameAD,$givennameAD,...,$managerAD)=split(/,$/);` splits a string on a comma that is at the end of the string. The `$` in the `/,$/` regex is an end-of-string anchor; see perlre, perlretut, and perlrequick. You cannot get more than two fields from this `split`, but you're trying to get quite a few fields. c:\@Work\Perl\monks>perl -wMstrict -le "use warnings; use strict; ;; use Data::Dumper; ;; $_ = 'vv,WWWW,xxx,YY,zzzz,'; ;; my ($v, $w, $x, $y, $z) = split(/,$/); print Dumper($v, $w, $x, $y, $z); ;; print qq{'$v' '$w' '$x' '$y' '$z'}; " $VAR1 = 'vv,WWWW,xxx,YY,zzzz'; $VAR2 = ''; $VAR3 = undef; $VAR4 = undef; $VAR5 = undef; Use of uninitialized value in concatenation (.) or string at -e line 1 +. Use of uninitialized value in concatenation (.) or string at -e line 1 +. Use of uninitialized value in concatenation (.) or string at -e line 1 +. 'vv,WWWW,xxx,YY,zzzz' '' '' '' '' [download] You see in this example the exact output of the `split` operation; probably not what you wanted and expected. Try this example again with a string that does not end in a comma character; there is a tiny but significant difference. Try it with an empty string as input. The example above uses Data::Dumper. This utility for visualizing data can be a sanity-saver. It is a core module (i.e., has been made a part of the standard Perl distribution; see corelist for getting info on core modules). I prefer Data::Dump, but it is not core. This post addresses just one, small aspect of debugging; there are many more. Good luck. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Comparing two files line by line and exporting the differences from the first file by Cristoforo (Curate) on Jul 23, 2018 at 02:58 UTC
jzelkowsz To get the data in what I believe is the way it is probably presented, I removed the newline immediately following a comma. That way, the entire employee record is on one line. I did not parse the file using Text::CSV as I probably should've. Your data is as follows: ~~HR data~~ samaccountname,givenname,sn,initials,employeenumber,symphonyemployeety +pe,mail,title,department,company,l,physicaldeliveryoffice,streetaddre +ss,st,postalcode,telephonenumber,manager barsu991,Uttiam,Barski,K,20114598,IKP,Uttiam.Barski@pulse.org,Director + of Cooks,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker,WI +,50987,555-555-5555,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=ne +t" walkl003,Lreblemet,Walker,J,20178941,IKP,Lreblemet.Walker@pulse.org,He +ad Cook,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker,WI,5 +0987,555-555-5551,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" karss001,Sovyetk,Karsten,Y,20146598,IKP,Sovyetk.Karsten@pulse.org,Dish +washer,Day Kitchen,MILIFO,Alpena,Kitchen of the World,205 Willy B. Te +mple,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" zingk072,Kovon,Zingerman,K,20113578,IKP,Kovon.Zingerman@pulse.org,Bake +r,Day Kitchen,MILIFO,Alpena,Kitchen of the World,205 Willy B. Temple, +WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" peizs194,Synthia,Smite,B,20134743,IKP,Synthia.Peizer@pulse.org,Broiler + Man,Day Kitchen,MILIFO,Alpena,Kitchen of the World,205 Willy B. Temp +le,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" hutcy231,Yello,Hutchinson,W,20145712,IKP,Yello Hutchinson,@pulse.org,B +ottle Washer,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker +,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" haserz221,Zebediah,Haserkrilk,L,20125471,IKP,Zebediah.Haserkrilk@kit.o +rg,Purchaser,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker +,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" ~~[download]~~ AB data samaccountname,givenname,sn,initials,employeenumber,symphonyemployeety +pe,mail,title,department,company,l,physicaldeliveryoffice,streetaddre +ss,st,postalcode,telephonenumber,manager barsu991,Uttiam,Barski,K,20114598,IKP,William.Barski@pulse.org,Chief o +f Cooks,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker,WI,5 +0987,555-555-5555,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" walkl003,Larry,Walker,J,,IKP,Larry.Walker@pulse.org,Cook,Day Kitchen,M +ILIFO,Alpena,Kitchen of the World,400 Baker,WI,50987,555-555-5551,"cn +=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" karss001,Steven,Karsten,Y,20146598,IKP,Steven.Karsten@pulse.org,Dishw, +Day Kitchen,MILIFO,Alpena,Sully's Kitchen,48720 Belcard,IL,34567,,"cn +=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" zingk072,Kevin,Zingerman,K,,,Kevin.Zingerman@pulse.org,Baker,Day Kitch +en,MILIFO,Alpena,Kitchen of the World,205 Willy B. Temple,WI,50987,,p +eizs194,Samantha,Smith,B,20134743,IKP,Samantha.Smith@pulse.org,"Man, +Broiler",Day Kitchen,MILIFO,Alpena,Kitchen of the World,205 Willy B. +Temple,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" hutcy231,Yaren,Hutchinson,W,20145712,IKP,Yaren Hutchinson,@pulse.org,B +ottle Washer,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker +,WI,50987,,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" haserz221,Zebediah,Hasermann,L,,IKP,Zebediah.Haserman@pulse.org,Purcha +ser,Day Kitchen,MILIFO,Alpena,Kitchen of the World,400 Baker,WI,50987 +,555-555-5555,"cn=manager1,ou=users,ou=Kitchen,dc=Kitchen,dc=net" [download] This solution assumes 'samaccountname' is correct between the 2 files and checks that the headers are the same for each file. Also I noticed there were some entries in the AD file that weren't in the HR file. I didn't try to compare since they weren't in the same file. I didn't know how you would handle this situation. #!/usr/bin/perl use strict; use warnings; my $hr_file = 'HR.txt'; open my $fh, '<', $hr_file or die $!; my (undef, @hdr_hr) = split /,/, <$fh>; chomp @hdr_hr; my %hr_data; while (<$fh>) { chomp; my ($id, @rest) = split /,/; @{ $hr_data{$id} }{@hdr_hr} = @rest; } close $fh or die $!; my $ad_file = 'AD.txt'; open $fh, '<', $ad_file or die $!; my (undef, @hdr_ad) = split /,/, <$fh>; chomp @hdr_ad; @hdr_ad ~~ @hdr_hr or die "Uncompatible headers between HR and AD file +s\n"; my %ad_data; while (<$fh>) { chomp; my ($id, @rest) = split /,/; @{ $ad_data{$id} }{@hdr_ad} = @rest; } close $fh or die $!; for my $id (sort keys %hr_data) { next unless exists $ad_data{$id}; for my $hdr (@hdr_hr) { my $description_hr = $hr_data{$id}{$hdr}; my $description_ad = $ad_data{$id}{$hdr}; print "$id,$hdr,$description_hr\n" unless $description_hr eq $description_ad; } } [download] Output I got is: barsu991,mail,Uttiam.Barski@pulse.org barsu991,title,Director of Cooks haserz221,sn,Haserkrilk haserz221,employeenumber,20125471 haserz221,mail,Zebediah.Haserkrilk@kit.org haserz221,telephonenumber, hutcy231,givenname,Yello hutcy231,mail,Yello Hutchinson karss001,givenname,Sovyetk karss001,mail,Sovyetk.Karsten@pulse.org karss001,title,Dishwasher karss001,physicaldeliveryoffice,Kitchen of the World karss001,streetaddress,205 Willy B. Temple karss001,st,WI karss001,postalcode,50987 walkl003,givenname,Lreblemet walkl003,employeenumber,20178941 walkl003,mail,Lreblemet.Walker@pulse.org walkl003,title,Head Cook zingk072,givenname,Kovon zingk072,employeenumber,20113578 zingk072,symphonyemployeetype,IKP zingk072,mail,Kovon.Zingerman@pulse.org zingk072,manager,"cn=manager1 [download]	[reply] [d/l] [select]
Re^2: Comparing two files line by line and exporting the differences from the first file (updated) by AnomalousMonk (Archbishop) on Jul 23, 2018 at 03:18 UTC
`@hdr_ad ~~ @hdr_hr or die "Uncompatible headers between HR and AD files\n";` The `~~` smartmatch operator is ~~discouraged from use in production code.~~ \| not encouraged for use in production code because it is "experimental." See Terminology in perlpolicy. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks