Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello Perl Monks. First time post, long time gleaner of information which is and has been very appreciated. Anyway I have an issue that I truly need some assistance with. I am fairly new to hash tables. What I need to do is compare three hash tables each containing between 14 million and 28 million keys with associated values. Sure, I could nest some loops and let it rip and come back in a year or so to see if it worked and hoping that no server crashes occurred.
Optimally, I'd like to find the common keys between each of the three hash tables and gather the values from each of the three hash tables and create a new, forth hash table that contains key => [value_hash_1, value_hash_2, value_hash_3] And for a key that does not exist in all three hashes, do nothing; the resulting forth hash table will have somewhat less than 14 million k/v pairs.
Do I make sense? And the solution is most likely posted, but I had difficulty finding it.
Thank you very much for your assistance!
Re: multiple hash compare, find, create
by johngg (Canon) on Dec 10, 2018 at 18:55 UTC
|
I would choose the smallest of the three hashes and iterate over its keys in a loop doing an exists on the same key in the other two hashes, moving on to the next key unless existing in both of the others. If in all three, add an array ref to that key in the fourth hash. Something like:-
my %h4;
foreach my $commonKey ( keys %h1 )
{
next unless exists $h2{ $commonKey } && exists $h3{ $commonKey };
$h4{ $commonKey } = [
$h1{ $commonKey }, $h2{ $commonKey }, $h3{ $commonKey }
];
}
I hope this is helpful.
| [reply] [Watch: Dir/Any] [d/l] |
Re: multiple hash compare, find, create
by Eily (Monsignor) on Dec 10, 2018 at 18:45 UTC
|
How do you get such big hashes in the first place? Searching for duplicates shouldn't be much longer than building the hashes (unless done horribly wrong), so if you can build the hashes in a matter of minutes, finding the triplets should also take more minutes, not years. Maybe you are actually using tied hashes of some sort, like a database?
| [reply] [Watch: Dir/Any] |
|
Hello Eily. I should have provided a better explanation regarding the creation of the large hash tables. I'm using the Splunk API to pull large datasets overnight. However, it is impossible to pull all of the information we need from Splunk in one query. In addition, the Splunk administrators have incorporated limits on the amount of data that can be pulled and the number of daily searches that any single user can pull. I know, rather draconian. Therefore I am gathering all of the raw data in three separate data pulls from Splunk and making extensive use of Perl to parse the raw flat files into usable .csv files.
I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) hoping to enable efficient grouping into a single hash table with each key to be associated with three values.
I am stumped
Again:
hash_1 (input)
key => [value_hash_1]
hash_2 (input)
key => [value_hash_2]
hash_3 (input)
key => [value_hash_3]
hash_4 (output)
key => [value_hash_1, value_hash_2, value_hash_3]
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I don't know how time sensitive your project is, but even if you can afford to see your program run for 10 minutes, it's always nice to have something faster than that :D. You can actually speed things up a lot by *not* building the three hashes. johngg had a good point when he said that you should base your search on the smallest hash. Going even further you only need to build that one. This would give something like (that's pseudo-perl, but you should get the idea):
my %h1 = $csv1->csv_to_hash; # with $csv1 the smallest file
my %temp;
while (my $line = $csv2->next)
{
my $key = $line->key;
# Ignore lines that don't also exist in h1
# That's way less data to build into the second hash
next unless exists $h1{$key};
my $value = $line->value;
$temp{ $key } = [ $h1{$key}, $value ];
}
%h1 = (); # Free the data from %h1, the matching pairs are all in %tem
+p anyway
my %out;
while (my $line = $csv3->next)
{
my $key = $line->key;
# As before, ignore non relevant keys
next unless exists $temp{$key};
# We make a copy of the array ref from %temp
# This means that technically modifying it also changes the content
+of %temp
# But it will be deleted anyway
my $values_array = $temp{$key};
my $value = $line->value;
push @$values_array, $value;
# Add the values into a brand new hash
# So that it contains only the needed keys
$out{$key} = $values_array;
}
%temp = (); # The important keys have been copied to %out;
You should also consider using Text::CSV if you're not already doing so :)
Edit: oh, and I showed the wrong example, but you should avoid variable names that are identical except for the number as much as possible. So maybe rename %h1 to %ref, if you don't already have better (distinct) names available for your hashes. | [reply] [Watch: Dir/Any] [d/l] |
|
I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) ...
I am stumped
If you can fit your three hash tables into memory (and the quoted statement says you're doing just that), then I don't see why Eily's approach here (and johngg's identical approach) would present any implementation problem. The only stumbling block I can see is that the output hash might not also fit into memory. In this case, common key/value-set records could be written out to a file, perhaps a .csv file, for later processing.
If the problem you're facing is that the input data is so large that even one input hash won't fit into memory, then an approach of doing an "external" sort of each input file and then merging the sorted input files with a Perl script suggests itself. This approach scales well to input files of enormous size, far larger than anything that could be accommodated in system RAM, and large numbers of input files, and can still be executed in minutes (in most cases) rather than hours.
hash_1 (input)
key => [value_hash_1]
If this is the structure of your input hashes, I don't see why you're going to the extra (and superfluous) effort of putting each value into an anonymous array — and then taking it out again when you create the output hash value. In what you've shown us so far, each input hash key has only a single value; why stick it into an array?
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: multiple hash compare, find, create
by kcott (Archbishop) on Dec 11, 2018 at 07:54 UTC
|
Reading through the various bits of information posted here, it seems your current process is:
-
Capture data in 3 raw files
-
Convert those to 3 CSV files
-
Extract data from those into 3 hashes
-
Combine keys with 3 values into 1 new hash
That would seem to be a lot of unnecessary work.
Do you have some additional use for the intermediary CSV files?
Do you have some additional use for the intermediary hashes?
I don't know what your initial raw data looks like.
I dummied up 3 files (pm_1227048_raw_1, pm_1227048_raw_2 & pm_1227048_raw_3)
from the data posted in "Re^2: multiple hash compare, find, create".
Each contains an exact copy of what's posted there; for example,
$ cat pm_1227048_raw_1 | head -3
a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3
b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3
P0118760075 => 92.1.3
Unfortunately, while some keys had 2 associated vales, none had 3 values.
I've added an extra step, in the script below, to show the technique.
If your real data has keys with 3 values, you can dispense with that extra step.
The comments should make this clear.
Here's the example script to show the technique.
Each raw data file is parsed once to create one hash.
When all data has been collected, one delete statement
removes the unwanted key-value pairs.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dump;
my @files = qw{
pm_1227048_raw_1
pm_1227048_raw_2
pm_1227048_raw_3
};
my %trio;
for my $file (@files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($k, $v) = split / => /;
push @{$trio{$k}}, $v;
}
}
# This step for demonstration purposes only
print "All data:\n";
dd \%trio;
# Extra step due to poor input
print "Data with 2 or more values:\n";
delete @trio{grep @{$trio{$_}} < 2, keys %trio};
dd \%trio;
# Only this step required with better input
print "Data with 3 or more values:\n";
delete @trio{grep @{$trio{$_}} < 3, keys %trio};
dd \%trio;
Output:
All data:
{
"06bbe788-e57d-4eda-98ea-74d8a45a0e56" => ["3.2p10s1"],
"08141110-f817-4c16-bf7b-8d0e6696a95b" => ["91.1.2"],
"205dae51-ea2e-4db9-ace1-315b940686e6" => ["91.1.2", 829960012005940
+7],
"29568879-fcca-4dc6-86be-3c8c86ef26db" => [8497101420498122],
"2e530dc0-a164-4c06-ae18-332eb6778ebd" => ["3.2p10s1"],
"37d6871a-3abc-44ee-819a-eea33440b0a4" => ["3.1p7s7"],
"55ccc30e-3566-4a00-b219-4b084487384c" => ["92.1.3", 849835011259060
+0],
"5f356d12-0213-4d5f-8fe7-08fe1d2a35d9" => ["3.2p10s2"],
"64bc2611-38a6-4a59-8d80-4f49b7a76f69" => ["92.1.3"],
"6ad8af7c-c56b-480f-bed6-9591b80cf634" => ["3.0p9s1"],
"71be5e75-edad-4889-9261-e1ffa89e393f" => ["92.1.3", 877310394036574
+5],
"97891097-70d7-4273-b1ae-3b88b460d591" => ["3.2p8s3"],
"9d986ace-2504-4595-bdbb-1899812e9d54" => ["91.1.2", 877310391004605
+1],
"a6fbb013-b75f-4dd7-9d1a-24f566020042" => ["92.1.3", 849910109001824
+0],
"a915d30a-541c-4f5e-9b2f-297352f7e19c" => ["92.1.3", 877770318763522
+5],
"b2c6e317-2072-4e3a-9278-5f76af49221a" => [8499102590027251],
"b4206f77-25e9-4ccd-b434-2237360f1f8c" => ["3.1p10s1"],
"b6a4c433-72a5-4e1a-b378-4a6b72531ded" => ["92.1.3"],
"c6e0b7c8-4999-4e83-b7d9-c28a62613614" => ["92.1.3", 849574144141455
+8],
"c8b2958f-7777-45e2-929a-adbe41f5055f" => ["3.2p10s1"],
"dff7f963-ec15-440a-9150-b61f55afe8a4" => ["3.2p9s1"],
"e173efe4-76f8-47fa-9923-500a3fe9715d" => ["92.1.0", 877310212021852
+6],
"P0107577526" => ["3.2p3s1"],
"P0112055731" => ["3.2p10s1"],
"P0116761501" => ["3.2p10s1"],
"P0118760075" => ["92.1.3", 8495840020455261],
"P0127439637" => ["92.1.3", 8155600386311784],
"P0127646016" => ["3.2p10s1"],
"P0128132579" => ["3.2p10s1"],
"P0128193326" => [8993110670064343],
"P0130482072" => ["92.1.3", 8499100024861022],
}
Data with 2 or more values:
{
"205dae51-ea2e-4db9-ace1-315b940686e6" => ["91.1.2", 829960012005940
+7],
"55ccc30e-3566-4a00-b219-4b084487384c" => ["92.1.3", 849835011259060
+0],
"71be5e75-edad-4889-9261-e1ffa89e393f" => ["92.1.3", 877310394036574
+5],
"9d986ace-2504-4595-bdbb-1899812e9d54" => ["91.1.2", 877310391004605
+1],
"a6fbb013-b75f-4dd7-9d1a-24f566020042" => ["92.1.3", 849910109001824
+0],
"a915d30a-541c-4f5e-9b2f-297352f7e19c" => ["92.1.3", 877770318763522
+5],
"c6e0b7c8-4999-4e83-b7d9-c28a62613614" => ["92.1.3", 849574144141455
+8],
"e173efe4-76f8-47fa-9923-500a3fe9715d" => ["92.1.0", 877310212021852
+6],
"P0118760075" => ["92.1.3", 8495840020455261],
"P0127439637" => ["92.1.3", 8155600386311784],
"P0130482072" => ["92.1.3", 8499100024861022],
}
Data with 3 or more values:
{}
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: multiple hash compare, find, create
by stevieb (Canon) on Dec 10, 2018 at 18:37 UTC
|
Here's one way, although it does iterate over all three hashes twice. My brain hasn't quite kicked in yet this Monday morning, so I'm likely seriously overlooking a more efficient way to do it:
First, create a new temporary hash that has takes the count of keys, then using that new hash with the appropriate keys (the ones that have been counted at least three times), append the values from all three hashes into the new %combined hash:
use warnings;
use strict;
use Data::Dumper;
my $num_hashes = 3;
my %h1 = (
a => 1,
b => 2,
c => 3,
);
my %h2 = (
a => 4,
b => 5,
d => 10,
);
my %h3 = (
x => 20,
p => 15,
b => 6,
a => 12,
);
my %key_list;
for my $hash (\%h1, \%h2, \%h3){
$key_list{$_}++ for keys %$hash;
}
my %combined;
for my $hash (\%h1, \%h2, \%h3){
for (keys %key_list){
next if $key_list{$_} < $num_hashes;
push @{ $combined{$_} }, $hash->{$_};
}
}
print Dumper \%combined;
Output:
$VAR1 = {
'b' => [
2,
5,
6
],
'a' => [
1,
4,
12
]
};
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
use strict;
use warnings;
use Data::Dump 'pp';
my %h1 = (
a => 1,
b => 2,
c => 3,
);
my %h2 = (
a => 4,
b => 5,
d => 10,
);
my %h3 = (
x => 20,
p => 15,
b => 6,
a => 12,
);
my %out;
for my $key (keys %h1)
{
next unless exists $h2{$key} and exists $h3{$key};
$out{$key} = [ $h1{$key}, $h2{$key}, $h3{$key} ];
}
pp \%out;
__END__
{ a => [1, 4, 12], b => [2, 5, 6] }
But seeing the size of the input hashes, I'd say it might be better to do that while constructing the hashes (eg, first build %h1, then iterate through the values of %h2 but only insert the keys that already exist in %h1, and then the same for %h3), if they aren't tied hashes in the first place. | [reply] [Watch: Dir/Any] [d/l] |
|
D'oh!
I don't know how I didn't think of that! If the first hash keys don't exist in hash2 or hash3, then they key doesn't exist in all three, does it. Likewise, if hash2 or hash3 have keys hash1 doesn't, same. ;) Sheesh, leave me alone and go home Monday...
OP... This answer by Eily is by far a better, more efficient and more elegant approach than what I presented.
| [reply] [Watch: Dir/Any] |
|
Thanks @stevieb! I appreciate you looking at this. I might build a small test case and try this out. Thanks again! EDIT: Actually, I like this method. Giving it a go now!
| [reply] [Watch: Dir/Any] |
Re: multiple hash compare, find, create
by supertaco (Novice) on Dec 11, 2018 at 13:33 UTC
|
Hello Perl Monks. Thanks for the help yesterday solving my issue. I implemented Cristoforo's solution and it worked like a charm. However, as I have time, I will work the other solutions into my code; I'm learning a lot from everyone here and I greatly appreciate your assistance. Thanks again and I hope to contribute to the knowledge base in the near future.
BTW, the entire script processes three hash tables, each with between 14 million and 28 million k/v pairs and outputs a forth hash table in the same size realm in about 20 minutes. Perl Monks rock!
| [reply] [Watch: Dir/Any] |
Re: multiple hash compare, find, create
by Cristoforo (Curate) on Dec 10, 2018 at 18:57 UTC
|
I don't know how well this would scale on 14MB sets, but this might be a possible solution using Set::Scalar.
Yes, Eily had the most efficient solution. There's no need to create the 3 sets :-(
use Set::Scalar;
my %hash_4;
my $set_1 = Set::Scalar->new(keys %hash_1);
my $set_2 = Set::Scalar->new(keys %hash_2);
my $set_3 = Set::Scalar->new(keys %hash_3);
my $intersect = $set_1 * $set_2 * $set_3;
for my $key (@$intersect) {
push @{$hash_4{$key}}, $hash_1{$key}, $hash_2{$key}, $hash_3{$key}
+;
}
| [reply] [Watch: Dir/Any] [d/l] |
|
[root@WarGame2 BRRX]# time ./a.pl
26662441 xre hash size -- 26662441 xre hash keys
14700586 rdk hash size -- 14700586 rdk hash keys
23405918 bid hash size -- 23405918 bid hash keys
doing the set scalar stuff
| [reply] [Watch: Dir/Any] [d/l] |
|
print "doing the set scalar stuff\n";
use Set::Scalar;
%output;
$ridxre = Set::Scalar->new(keys %ridxre);
$ridrdk = Set::Scalar->new(keys %ridrdk);
$ridbid = Set::Scalar->new(keys %ridbid);
$intersect = $ridxre * $ridrdk * $ridbid;
for $key (@$intersect) {
push @{$output{$key}}, $ridxre{$key}, $ridrdk{$key}, $ridbid{$key}
+;
}
print "writing OUT.txt file\n";
open(OUT,">","$outfile1") || die("cannot open $outfile1\n");
while ( ($k,$v) = each %output ) {
print OUT "$k ";
print OUT "$_ " for @{ $output{$k} };
print OUT "\n";
}
close(OUT);
| [reply] [Watch: Dir/Any] [d/l] |
Re: multiple hash compare, find, create
by LanX (Saint) on Dec 10, 2018 at 23:28 UTC
|
| [reply] [Watch: Dir/Any] |
Re: multiple hash compare, find, create
by thanos1983 (Parson) on Dec 10, 2018 at 22:13 UTC
|
Hello Anonymous Monk,
I see the monks have replied to your question. I saw the question 2 - 3 hours ago and I start working on it but I got busy with something else work related.
Any way just for fun (it took so much time to prepare the hashes :P).
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Why is the key "a6fbb013-b75f-4dd7-9d1a-24f566020042" present in the output hash when it seems to appear only in the %hash_1 and %hash_3 input hashes and not in the %hash_2 input hash? Why is this key present in the value-set array for this key in the output hash? (Indeed, every key in the output hash seems to be present in each associated value-set array.) This does not seem to comport with the output that supertaco wants.
I would follow the same approach as fellow Monk stevieb.
But stevieb has endorsed Eily's approach.
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %h1 = ( a => 1,
b => 2,
c => 3, );
my %h2 = ( a => 4,
b => 5,
d => 10, );
my %h3 = ( x => 20,
p => 15,
b => 6,
a => 12, );
my %HoA;
foreach my $key (keys %h1) {
push(@{$HoA{$key}}, $h1{$key}, $h2{$key}, $h3{$key})
if exists $h2{$key} and exists $h3{$key};
}
print Dumper \%HoA;
__END__
$ perl test.pl
$VAR1 = {
'b' => [
2,
5,
6
],
'a' => [
1,
4,
12
]
};
Though Eily's approach is better using unless as it check if the hash exists if not skip not necessary to proceed and waste resources.
Thanks for spending the time to check and correct me :)
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: multiple hash compare, find, create
by Anonymous Monk on Dec 10, 2018 at 17:17 UTC
|
a bit of sample I&O please? | [reply] [Watch: Dir/Any] |
|
Hi there and thanks!
hash_1
a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3
b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3
P0118760075 => 92.1.3
9d986ace-2504-4595-bdbb-1899812e9d54 => 91.1.2
55ccc30e-3566-4a00-b219-4b084487384c => 92.1.3
P0127439637 => 92.1.3
08141110-f817-4c16-bf7b-8d0e6696a95b => 91.1.2
71be5e75-edad-4889-9261-e1ffa89e393f => 92.1.3
c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 92.1.3
e173efe4-76f8-47fa-9923-500a3fe9715d => 92.1.0
P0130482072 => 92.1.3
a915d30a-541c-4f5e-9b2f-297352f7e19c => 92.1.3
64bc2611-38a6-4a59-8d80-4f49b7a76f69 => 92.1.3
205dae51-ea2e-4db9-ace1-315b940686e6 => 91.1.2
hash_2
5f356d12-0213-4d5f-8fe7-08fe1d2a35d9 => 3.2p10s2
dff7f963-ec15-440a-9150-b61f55afe8a4 => 3.2p9s1
P0107577526 => 3.2p3s1
P0112055731 => 3.2p10s1
06bbe788-e57d-4eda-98ea-74d8a45a0e56 => 3.2p10s1
P0127646016 => 3.2p10s1
b4206f77-25e9-4ccd-b434-2237360f1f8c => 3.1p10s1
97891097-70d7-4273-b1ae-3b88b460d591 => 3.2p8s3
c8b2958f-7777-45e2-929a-adbe41f5055f => 3.2p10s1
6ad8af7c-c56b-480f-bed6-9591b80cf634 => 3.0p9s1
2e530dc0-a164-4c06-ae18-332eb6778ebd => 3.2p10s1
P0116761501 => 3.2p10s1
37d6871a-3abc-44ee-819a-eea33440b0a4 => 3.1p7s7
P0128132579 => 3.2p10s1
hash_3
P0128193326 => 8993110670064343
a6fbb013-b75f-4dd7-9d1a-24f566020042 => 8499101090018240
29568879-fcca-4dc6-86be-3c8c86ef26db => 8497101420498122
9d986ace-2504-4595-bdbb-1899812e9d54 => 8773103910046051
P0118760075 => 8495840020455261
55ccc30e-3566-4a00-b219-4b084487384c => 8498350112590600
P0127439637 => 8155600386311784
71be5e75-edad-4889-9261-e1ffa89e393f => 8773103940365745
c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 8495741441414558
e173efe4-76f8-47fa-9923-500a3fe9715d => 8773102120218526
P0130482072 => 8499100024861022
a915d30a-541c-4f5e-9b2f-297352f7e19c => 8777703187635225
205dae51-ea2e-4db9-ace1-315b940686e6 => 8299600120059407
b2c6e317-2072-4e3a-9278-5f76af49221a => 8499102590027251
desired output (just blur your eyes and assume each key has the correct values. I faked the output.)
a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3,3.2p9s1,339910259002725
+1
b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3,3.0p10s1,84991025967892
+09
P0118760075 => 92.1.1,3.2p10s1,8499102592767903
9d986ace-2504-4595-bdbb-1899812e9d54 => 91.1.2,3.2p8s1,849910259267897
+2
55ccc30e-3566-4a00-b219-4b084487384c => 92.2.2,3.0p10s1,98638072900272
+51
P0127439637 => 92.1.3,3.0p5s1,8499102590026782
08141110-f817-4c16-bf7b-8d0e6696a95b => 91.1.2,3.2p10s1,78909876782652
+78
71be5e75-edad-4889-9261-e1ffa89e393f => 92.1.3,3.2p9s1,909098762762527
+9
c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 92.1.2,3.1p9s4,726562782789287
+6
e173efe4-76f8-47fa-9923-500a3fe9715d => 92.1.0,3.2p10s1,84208876251909
+82
P0130482072 => 92.1.3,3.2p10s1,8499102537809876
a915d30a-541c-4f5e-9b2f-297352f7e19c => 92.1.3,3.2p10s2,18765467890876
+54
64bc2611-38a6-4a59-8d80-4f49b7a76f69 => 92.0.3,3.2p8s1,849267890876525
+6
205dae51-ea2e-4db9-ace1-315b940686e6 => 91.1.2,3.2p10s2,84991027000987
+62
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|