udvk009 has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks, Is it possible to compare the contents of array using check sum algorithm ?
objective - Lets say i have a 2 multidimensional array that may have few hundred thousand rows(lets sat 500,000 ) and my objective is to compare this 2 arrays using check-sum to find if they are different. i.e. lets say array-1 may have row-x which may be missing in array-2. Conceptually i want to find out why my check-sum function returns same result for 2 different arrays. Please advise how to go about it. I have tried a sample code with very small set of array , please advise why the check-sum shows same result for both array ?
#!C:\Perl5.16\bin\perl.exe
use Data::Dumper;
use Digest::MD5 qw(md5 md5_hex md5_base64);
my @array1 = (
[1,'John','ABXC12132328'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
my @array2 = (
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
#print Dumper(\@array1);
my $ref_array1 = @array1;
my $ref_array2 = @array2;
my $str = md5($ref_array1);
my $str2 = md5($ref_array2);
print "md-check-sum for array1 :: ".unpack('L', $str)."\n";
print "md-check-sum for array2 :: ".unpack('L', $str2)."\n";
output shows as below
md-check-sum for array1 :: 2134629092
md-check-sum for array2 :: 2134629092
Re: Checksum on Multidimentional Array - how does it work
by BrowserUk (Patriarch) on Mar 26, 2015 at 11:55 UTC
|
Because you're not checksumming the arrays. You are checksumming the lengths of the arrays (which are the same):
my $ref_array1 = @array1; ## Assigns the length of @array1 to $ref_ar
+ray!!!
my $ref_array2 = @array2; ## Ditto!
To checksum the contents of the arrays, one way would be to serialise them (convert to a string representation): #!C:\Perl5.16\bin\perl.exe
use Data::Dumper;
use Digest::MD5 qw(md5 md5_hex md5_base64);
my @array1 = (
[1,'John','ABXC12132328'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
my @array2 = (
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
#print Dumper(\@array1);
my $ref_array1 = Dumper( \@array1 );
my $ref_array2 = Dumper( \@array2 );
my $str = md5_hex($ref_array1);
my $str2 = md5_hex($ref_array2);
print "md-check-sum for array1 :: " . $str . "\n";
print "md-check-sum for array2 :: " . $str2 . "\n";
__END__
C:\test>1121378
md-check-sum for array1 :: b636a47153af27317478e3bca3632602
md-check-sum for array2 :: a4882627a89775602ab2e33762a70e81
Note also that I've nixed your unpack 'L', stuff which throws away 3/4 of the information in the 128-bit checksum by converting only the first 32-bits to a number.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
Re: Checksum on Multidimentional Array - how does it work
by monkey_boy (Priest) on Mar 26, 2015 at 12:04 UTC
|
my $ref_array1 = @array1;
my $ref_array2 = @array2;
my $str = md5($ref_array1);
my $str2 = md5($ref_array2);
You are doing a digest both times on the count of the items in the arrays, i.e:
print $ref_array1;
print $ref_array2;
outputs:
5
5
Even if you actually took a reference to to these arrays, (my $ref_array1 = \@array1;) your solution would never work, as you would be digesting just a memory code/id for the two named arrays.
The solution here is I suspect, to serialize the arrays & digest the stringified values, e.g:
#!/usr/bin/env perl
use Modern::Perl;
use Data::Dumper;
use Digest::MD5 qw(md5 md5_hex md5_base64);
my @array1 = (
[1,'John','ABXC12132328'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
my @array2 = (
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322'],
[0,'John','ABXC12132322']
);
my @array3 = @array2;
#print Dumper(\@array1);
my $md5_1 = md5_hex(Dumper(\@array1));
my $md5_2 = md5_hex(Dumper(\@array2));
my $md5_3 = md5_hex(Dumper(\@array3));
say 1,' ',$md5_1;
say 2,' ',$md5_2;
say 3,' ',$md5_3;
This is not a Signature...
| [reply] [d/l] [select] |
|
| [reply] |
|
|