Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

multiple hash compare, find, create

by Anonymous Monk
on Dec 10, 2018 at 17:12 UTC ( [id://1227048]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks. First time post, long time gleaner of information which is and has been very appreciated. Anyway I have an issue that I truly need some assistance with. I am fairly new to hash tables. What I need to do is compare three hash tables each containing between 14 million and 28 million keys with associated values. Sure, I could nest some loops and let it rip and come back in a year or so to see if it worked and hoping that no server crashes occurred.

Optimally, I'd like to find the common keys between each of the three hash tables and gather the values from each of the three hash tables and create a new, forth hash table that contains  key => [value_hash_1, value_hash_2, value_hash_3] And for a key that does not exist in all three hashes, do nothing; the resulting forth hash table will have somewhat less than 14 million k/v pairs.

Do I make sense? And the solution is most likely posted, but I had difficulty finding it.

Thank you very much for your assistance!

Replies are listed 'Best First'.
Re: multiple hash compare, find, create
by johngg (Canon) on Dec 10, 2018 at 18:55 UTC

    I would choose the smallest of the three hashes and iterate over its keys in a loop doing an exists on the same key in the other two hashes, moving on to the next key unless existing in both of the others. If in all three, add an array ref to that key in the fourth hash. Something like:-

    my %h4; foreach my $commonKey ( keys %h1 ) { next unless exists $h2{ $commonKey } && exists $h3{ $commonKey }; $h4{ $commonKey } = [ $h1{ $commonKey }, $h2{ $commonKey }, $h3{ $commonKey } ]; }

    I hope this is helpful.

    Cheers,

    JohnGG

Re: multiple hash compare, find, create
by Eily (Monsignor) on Dec 10, 2018 at 18:45 UTC

    How do you get such big hashes in the first place? Searching for duplicates shouldn't be much longer than building the hashes (unless done horribly wrong), so if you can build the hashes in a matter of minutes, finding the triplets should also take more minutes, not years. Maybe you are actually using tied hashes of some sort, like a database?

      Hello Eily. I should have provided a better explanation regarding the creation of the large hash tables. I'm using the Splunk API to pull large datasets overnight. However, it is impossible to pull all of the information we need from Splunk in one query. In addition, the Splunk administrators have incorporated limits on the amount of data that can be pulled and the number of daily searches that any single user can pull. I know, rather draconian. Therefore I am gathering all of the raw data in three separate data pulls from Splunk and making extensive use of Perl to parse the raw flat files into usable .csv files.

      I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) hoping to enable efficient grouping into a single hash table with each key to be associated with three values.

      I am stumped

      Again:

      hash_1 (input)

      key => [value_hash_1]

      hash_2 (input)

      key => [value_hash_2]

      hash_3 (input)

      key => [value_hash_3]

      hash_4 (output)

      key => [value_hash_1, value_hash_2, value_hash_3]

        I don't know how time sensitive your project is, but even if you can afford to see your program run for 10 minutes, it's always nice to have something faster than that :D. You can actually speed things up a lot by *not* building the three hashes. johngg had a good point when he said that you should base your search on the smallest hash. Going even further you only need to build that one. This would give something like (that's pseudo-perl, but you should get the idea):

        my %h1 = $csv1->csv_to_hash; # with $csv1 the smallest file my %temp; while (my $line = $csv2->next) { my $key = $line->key; # Ignore lines that don't also exist in h1 # That's way less data to build into the second hash next unless exists $h1{$key}; my $value = $line->value; $temp{ $key } = [ $h1{$key}, $value ]; } %h1 = (); # Free the data from %h1, the matching pairs are all in %tem +p anyway my %out; while (my $line = $csv3->next) { my $key = $line->key; # As before, ignore non relevant keys next unless exists $temp{$key}; # We make a copy of the array ref from %temp # This means that technically modifying it also changes the content +of %temp # But it will be deleted anyway my $values_array = $temp{$key}; my $value = $line->value; push @$values_array, $value; # Add the values into a brand new hash # So that it contains only the needed keys $out{$key} = $values_array; } %temp = (); # The important keys have been copied to %out;

        You should also consider using Text::CSV if you're not already doing so :)

        Edit: oh, and I showed the wrong example, but you should avoid variable names that are identical except for the number as much as possible. So maybe rename %h1 to %ref, if you don't already have better (distinct) names available for your hashes.

        I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) ...

        I am stumped

        If you can fit your three hash tables into memory (and the quoted statement says you're doing just that), then I don't see why Eily's approach here (and johngg's identical approach) would present any implementation problem. The only stumbling block I can see is that the output hash might not also fit into memory. In this case, common key/value-set records could be written out to a file, perhaps a .csv file, for later processing.

        If the problem you're facing is that the input data is so large that even one input hash won't fit into memory, then an approach of doing an "external" sort of each input file and then merging the sorted input files with a Perl script suggests itself. This approach scales well to input files of enormous size, far larger than anything that could be accommodated in system RAM, and large numbers of input files, and can still be executed in minutes (in most cases) rather than hours.

        hash_1 (input)

        key => [value_hash_1]

        If this is the structure of your input hashes, I don't see why you're going to the extra (and superfluous) effort of putting each value into an anonymous array — and then taking it out again when you create the output hash value. In what you've shown us so far, each input hash key has only a single value; why stick it into an array?


        Give a man a fish:  <%-{-{-{-<

Re: multiple hash compare, find, create
by kcott (Archbishop) on Dec 11, 2018 at 07:54 UTC

    Reading through the various bits of information posted here, it seems your current process is:

    1. Capture data in 3 raw files
    2. Convert those to 3 CSV files
    3. Extract data from those into 3 hashes
    4. Combine keys with 3 values into 1 new hash

    That would seem to be a lot of unnecessary work. Do you have some additional use for the intermediary CSV files? Do you have some additional use for the intermediary hashes?

    I don't know what your initial raw data looks like. I dummied up 3 files (pm_1227048_raw_1, pm_1227048_raw_2 & pm_1227048_raw_3) from the data posted in "Re^2: multiple hash compare, find, create". Each contains an exact copy of what's posted there; for example,

    $ cat pm_1227048_raw_1 | head -3 a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3 b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3 P0118760075 => 92.1.3

    Unfortunately, while some keys had 2 associated vales, none had 3 values. I've added an extra step, in the script below, to show the technique. If your real data has keys with 3 values, you can dispense with that extra step. The comments should make this clear.

    Here's the example script to show the technique. Each raw data file is parsed once to create one hash. When all data has been collected, one delete statement removes the unwanted key-value pairs.

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Data::Dump; my @files = qw{ pm_1227048_raw_1 pm_1227048_raw_2 pm_1227048_raw_3 }; my %trio; for my $file (@files) { open my $fh, '<', $file; while (<$fh>) { chomp; my ($k, $v) = split / => /; push @{$trio{$k}}, $v; } } # This step for demonstration purposes only print "All data:\n"; dd \%trio; # Extra step due to poor input print "Data with 2 or more values:\n"; delete @trio{grep @{$trio{$_}} < 2, keys %trio}; dd \%trio; # Only this step required with better input print "Data with 3 or more values:\n"; delete @trio{grep @{$trio{$_}} < 3, keys %trio}; dd \%trio;

    Output:

    All data: { "06bbe788-e57d-4eda-98ea-74d8a45a0e56" => ["3.2p10s1"], "08141110-f817-4c16-bf7b-8d0e6696a95b" => ["91.1.2"], "205dae51-ea2e-4db9-ace1-315b940686e6" => ["91.1.2", 829960012005940 +7], "29568879-fcca-4dc6-86be-3c8c86ef26db" => [8497101420498122], "2e530dc0-a164-4c06-ae18-332eb6778ebd" => ["3.2p10s1"], "37d6871a-3abc-44ee-819a-eea33440b0a4" => ["3.1p7s7"], "55ccc30e-3566-4a00-b219-4b084487384c" => ["92.1.3", 849835011259060 +0], "5f356d12-0213-4d5f-8fe7-08fe1d2a35d9" => ["3.2p10s2"], "64bc2611-38a6-4a59-8d80-4f49b7a76f69" => ["92.1.3"], "6ad8af7c-c56b-480f-bed6-9591b80cf634" => ["3.0p9s1"], "71be5e75-edad-4889-9261-e1ffa89e393f" => ["92.1.3", 877310394036574 +5], "97891097-70d7-4273-b1ae-3b88b460d591" => ["3.2p8s3"], "9d986ace-2504-4595-bdbb-1899812e9d54" => ["91.1.2", 877310391004605 +1], "a6fbb013-b75f-4dd7-9d1a-24f566020042" => ["92.1.3", 849910109001824 +0], "a915d30a-541c-4f5e-9b2f-297352f7e19c" => ["92.1.3", 877770318763522 +5], "b2c6e317-2072-4e3a-9278-5f76af49221a" => [8499102590027251], "b4206f77-25e9-4ccd-b434-2237360f1f8c" => ["3.1p10s1"], "b6a4c433-72a5-4e1a-b378-4a6b72531ded" => ["92.1.3"], "c6e0b7c8-4999-4e83-b7d9-c28a62613614" => ["92.1.3", 849574144141455 +8], "c8b2958f-7777-45e2-929a-adbe41f5055f" => ["3.2p10s1"], "dff7f963-ec15-440a-9150-b61f55afe8a4" => ["3.2p9s1"], "e173efe4-76f8-47fa-9923-500a3fe9715d" => ["92.1.0", 877310212021852 +6], "P0107577526" => ["3.2p3s1"], "P0112055731" => ["3.2p10s1"], "P0116761501" => ["3.2p10s1"], "P0118760075" => ["92.1.3", 8495840020455261], "P0127439637" => ["92.1.3", 8155600386311784], "P0127646016" => ["3.2p10s1"], "P0128132579" => ["3.2p10s1"], "P0128193326" => [8993110670064343], "P0130482072" => ["92.1.3", 8499100024861022], } Data with 2 or more values: { "205dae51-ea2e-4db9-ace1-315b940686e6" => ["91.1.2", 829960012005940 +7], "55ccc30e-3566-4a00-b219-4b084487384c" => ["92.1.3", 849835011259060 +0], "71be5e75-edad-4889-9261-e1ffa89e393f" => ["92.1.3", 877310394036574 +5], "9d986ace-2504-4595-bdbb-1899812e9d54" => ["91.1.2", 877310391004605 +1], "a6fbb013-b75f-4dd7-9d1a-24f566020042" => ["92.1.3", 849910109001824 +0], "a915d30a-541c-4f5e-9b2f-297352f7e19c" => ["92.1.3", 877770318763522 +5], "c6e0b7c8-4999-4e83-b7d9-c28a62613614" => ["92.1.3", 849574144141455 +8], "e173efe4-76f8-47fa-9923-500a3fe9715d" => ["92.1.0", 877310212021852 +6], "P0118760075" => ["92.1.3", 8495840020455261], "P0127439637" => ["92.1.3", 8155600386311784], "P0130482072" => ["92.1.3", 8499100024861022], } Data with 3 or more values: {}

    — Ken

Re: multiple hash compare, find, create
by stevieb (Canon) on Dec 10, 2018 at 18:37 UTC

    Here's one way, although it does iterate over all three hashes twice. My brain hasn't quite kicked in yet this Monday morning, so I'm likely seriously overlooking a more efficient way to do it:

    First, create a new temporary hash that has takes the count of keys, then using that new hash with the appropriate keys (the ones that have been counted at least three times), append the values from all three hashes into the new %combined hash:

    use warnings; use strict; use Data::Dumper; my $num_hashes = 3; my %h1 = ( a => 1, b => 2, c => 3, ); my %h2 = ( a => 4, b => 5, d => 10, ); my %h3 = ( x => 20, p => 15, b => 6, a => 12, ); my %key_list; for my $hash (\%h1, \%h2, \%h3){ $key_list{$_}++ for keys %$hash; } my %combined; for my $hash (\%h1, \%h2, \%h3){ for (keys %key_list){ next if $key_list{$_} < $num_hashes; push @{ $combined{$_} }, $hash->{$_}; } } print Dumper \%combined;

    Output:

    $VAR1 = { 'b' => [ 2, 5, 6 ], 'a' => [ 1, 4, 12 ] };

      Iterating through the first hash and looking for keys that exist in the two other is enough.

      use strict; use warnings; use Data::Dump 'pp'; my %h1 = ( a => 1, b => 2, c => 3, ); my %h2 = ( a => 4, b => 5, d => 10, ); my %h3 = ( x => 20, p => 15, b => 6, a => 12, ); my %out; for my $key (keys %h1) { next unless exists $h2{$key} and exists $h3{$key}; $out{$key} = [ $h1{$key}, $h2{$key}, $h3{$key} ]; } pp \%out; __END__ { a => [1, 4, 12], b => [2, 5, 6] }
      But seeing the size of the input hashes, I'd say it might be better to do that while constructing the hashes (eg, first build %h1, then iterate through the values of %h2 but only insert the keys that already exist in %h1, and then the same for %h3), if they aren't tied hashes in the first place.

        D'oh!

        I don't know how I didn't think of that! If the first hash keys don't exist in hash2 or hash3, then they key doesn't exist in all three, does it. Likewise, if hash2 or hash3 have keys hash1 doesn't, same. ;) Sheesh, leave me alone and go home Monday...

        OP... This answer by Eily is by far a better, more efficient and more elegant approach than what I presented.

      Thanks @stevieb! I appreciate you looking at this. I might build a small test case and try this out. Thanks again! EDIT: Actually, I like this method. Giving it a go now!
Re: multiple hash compare, find, create
by supertaco (Novice) on Dec 11, 2018 at 13:33 UTC

    Hello Perl Monks. Thanks for the help yesterday solving my issue. I implemented Cristoforo's solution and it worked like a charm. However, as I have time, I will work the other solutions into my code; I'm learning a lot from everyone here and I greatly appreciate your assistance. Thanks again and I hope to contribute to the knowledge base in the near future.

    BTW, the entire script processes three hash tables, each with between 14 million and 28 million k/v pairs and outputs a forth hash table in the same size realm in about 20 minutes. Perl Monks rock!

Re: multiple hash compare, find, create
by Cristoforo (Curate) on Dec 10, 2018 at 18:57 UTC
    I don't know how well this would scale on 14MB sets, but this might be a possible solution using Set::Scalar.

    Yes, Eily had the most efficient solution. There's no need to create the 3 sets :-(

    use Set::Scalar; my %hash_4; my $set_1 = Set::Scalar->new(keys %hash_1); my $set_2 = Set::Scalar->new(keys %hash_2); my $set_3 = Set::Scalar->new(keys %hash_3); my $intersect = $set_1 * $set_2 * $set_3; for my $key (@$intersect) { push @{$hash_4{$key}}, $hash_1{$key}, $hash_2{$key}, $hash_3{$key} +; }

      Hi Cristoforo. I'm trying it anyway :) I appreciate your help! Will look at Eily's solution, too.

      [root@WarGame2 BRRX]# time ./a.pl 26662441 xre hash size -- 26662441 xre hash keys 14700586 rdk hash size -- 14700586 rdk hash keys 23405918 bid hash size -- 23405918 bid hash keys doing the set scalar stuff

      Hey Cristoforo, I implemented your solution (without the strikethroughs :)) and it worked like a dream and creates some flexibility for future extensions. I will look into implementing the other solution as I have time. But thanks again!

      print "doing the set scalar stuff\n"; use Set::Scalar; %output; $ridxre = Set::Scalar->new(keys %ridxre); $ridrdk = Set::Scalar->new(keys %ridrdk); $ridbid = Set::Scalar->new(keys %ridbid); $intersect = $ridxre * $ridrdk * $ridbid; for $key (@$intersect) { push @{$output{$key}}, $ridxre{$key}, $ridrdk{$key}, $ridbid{$key} +; } print "writing OUT.txt file\n"; open(OUT,">","$outfile1") || die("cannot open $outfile1\n"); while ( ($k,$v) = each %output ) { print OUT "$k "; print OUT "$_ " for @{ $output{$k} }; print OUT "\n"; } close(OUT);
Re: multiple hash compare, find, create
by LanX (Saint) on Dec 10, 2018 at 23:28 UTC
Re: multiple hash compare, find, create
by thanos1983 (Parson) on Dec 10, 2018 at 22:13 UTC

    Hello Anonymous Monk,

    I see the monks have replied to your question. I saw the question 2 - 3 hours ago and I start working on it but I got busy with something else work related.

    Any way just for fun (it took so much time to prepare the hashes :P).

      Why is the key "a6fbb013-b75f-4dd7-9d1a-24f566020042" present in the output hash when it seems to appear only in the %hash_1 and %hash_3 input hashes and not in the %hash_2 input hash? Why is this key present in the value-set array for this key in the output hash? (Indeed, every key in the output hash seems to be present in each associated value-set array.) This does not seem to comport with the output that supertaco wants.

      I would follow the same approach as fellow Monk stevieb.

      But stevieb has endorsed Eily's approach.


      Give a man a fish:  <%-{-{-{-<

        Hello AnomalousMonk,

        You are right. My solution after a bit of experimentation I can see that it is wrong. The best approach that I could think is similar to fellow Monk Eily.

        #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %h1 = ( a => 1, b => 2, c => 3, ); my %h2 = ( a => 4, b => 5, d => 10, ); my %h3 = ( x => 20, p => 15, b => 6, a => 12, ); my %HoA; foreach my $key (keys %h1) { push(@{$HoA{$key}}, $h1{$key}, $h2{$key}, $h3{$key}) if exists $h2{$key} and exists $h3{$key}; } print Dumper \%HoA; __END__ $ perl test.pl $VAR1 = { 'b' => [ 2, 5, 6 ], 'a' => [ 1, 4, 12 ] };

        Though Eily's approach is better using unless as it check if the hash exists if not skip not necessary to proceed and waste resources.

        Thanks for spending the time to check and correct me :)

        Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: multiple hash compare, find, create
by Anonymous Monk on Dec 10, 2018 at 17:17 UTC
    a bit of sample I&O please?

      Hi there and thanks!

      hash_1

      a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3 b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3 P0118760075 => 92.1.3 9d986ace-2504-4595-bdbb-1899812e9d54 => 91.1.2 55ccc30e-3566-4a00-b219-4b084487384c => 92.1.3 P0127439637 => 92.1.3 08141110-f817-4c16-bf7b-8d0e6696a95b => 91.1.2 71be5e75-edad-4889-9261-e1ffa89e393f => 92.1.3 c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 92.1.3 e173efe4-76f8-47fa-9923-500a3fe9715d => 92.1.0 P0130482072 => 92.1.3 a915d30a-541c-4f5e-9b2f-297352f7e19c => 92.1.3 64bc2611-38a6-4a59-8d80-4f49b7a76f69 => 92.1.3 205dae51-ea2e-4db9-ace1-315b940686e6 => 91.1.2

      hash_2

      5f356d12-0213-4d5f-8fe7-08fe1d2a35d9 => 3.2p10s2 dff7f963-ec15-440a-9150-b61f55afe8a4 => 3.2p9s1 P0107577526 => 3.2p3s1 P0112055731 => 3.2p10s1 06bbe788-e57d-4eda-98ea-74d8a45a0e56 => 3.2p10s1 P0127646016 => 3.2p10s1 b4206f77-25e9-4ccd-b434-2237360f1f8c => 3.1p10s1 97891097-70d7-4273-b1ae-3b88b460d591 => 3.2p8s3 c8b2958f-7777-45e2-929a-adbe41f5055f => 3.2p10s1 6ad8af7c-c56b-480f-bed6-9591b80cf634 => 3.0p9s1 2e530dc0-a164-4c06-ae18-332eb6778ebd => 3.2p10s1 P0116761501 => 3.2p10s1 37d6871a-3abc-44ee-819a-eea33440b0a4 => 3.1p7s7 P0128132579 => 3.2p10s1

      hash_3

      P0128193326 => 8993110670064343 a6fbb013-b75f-4dd7-9d1a-24f566020042 => 8499101090018240 29568879-fcca-4dc6-86be-3c8c86ef26db => 8497101420498122 9d986ace-2504-4595-bdbb-1899812e9d54 => 8773103910046051 P0118760075 => 8495840020455261 55ccc30e-3566-4a00-b219-4b084487384c => 8498350112590600 P0127439637 => 8155600386311784 71be5e75-edad-4889-9261-e1ffa89e393f => 8773103940365745 c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 8495741441414558 e173efe4-76f8-47fa-9923-500a3fe9715d => 8773102120218526 P0130482072 => 8499100024861022 a915d30a-541c-4f5e-9b2f-297352f7e19c => 8777703187635225 205dae51-ea2e-4db9-ace1-315b940686e6 => 8299600120059407 b2c6e317-2072-4e3a-9278-5f76af49221a => 8499102590027251

      desired output (just blur your eyes and assume each key has the correct values. I faked the output.)

      a6fbb013-b75f-4dd7-9d1a-24f566020042 => 92.1.3,3.2p9s1,339910259002725 +1 b6a4c433-72a5-4e1a-b378-4a6b72531ded => 92.1.3,3.0p10s1,84991025967892 +09 P0118760075 => 92.1.1,3.2p10s1,8499102592767903 9d986ace-2504-4595-bdbb-1899812e9d54 => 91.1.2,3.2p8s1,849910259267897 +2 55ccc30e-3566-4a00-b219-4b084487384c => 92.2.2,3.0p10s1,98638072900272 +51 P0127439637 => 92.1.3,3.0p5s1,8499102590026782 08141110-f817-4c16-bf7b-8d0e6696a95b => 91.1.2,3.2p10s1,78909876782652 +78 71be5e75-edad-4889-9261-e1ffa89e393f => 92.1.3,3.2p9s1,909098762762527 +9 c6e0b7c8-4999-4e83-b7d9-c28a62613614 => 92.1.2,3.1p9s4,726562782789287 +6 e173efe4-76f8-47fa-9923-500a3fe9715d => 92.1.0,3.2p10s1,84208876251909 +82 P0130482072 => 92.1.3,3.2p10s1,8499102537809876 a915d30a-541c-4f5e-9b2f-297352f7e19c => 92.1.3,3.2p10s2,18765467890876 +54 64bc2611-38a6-4a59-8d80-4f49b7a76f69 => 92.0.3,3.2p8s1,849267890876525 +6 205dae51-ea2e-4db9-ace1-315b940686e6 => 91.1.2,3.2p10s2,84991027000987 +62

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1227048]
Approved by Corion
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-03-29 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found