Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

The Best Hash

by artist (Parson)
on Dec 13, 2002 at 18:32 UTC ( [id://219679]=perlquestion: print w/replies, xml ) Need Help??

artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks
I have multiple hashes with same keys and values in each hash, but not necessarly with the same key-value association. How I can decide which hash is the least differing from other hashes?

Thanks,
Artist

Replies are listed 'Best First'.
Re: The Best Hash
by CountZero (Bishop) on Dec 13, 2002 at 18:47 UTC

    This begs an answer on the (filosophical) question: 'how to calculate the difference between hashes'.

    Is it the number of different key/value combinations or does one take into account the value of the values as well?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Hi CountZero
      Thanks for the attention.
      Little Clarification: The keys and values are coming from mutually independent sets.

      Artist

        If the value is a string value you may be able to use one of the distance modules on CPAN to calculate the closest match in a loop through all of the hashs if numeric you can put your test in a loop with a local var to see what is closest -- though the way you do this totally depends on the data in the key/value pairs. for example if you have an exact match do you need to stop? or do you need to find ALL exact matches? Do you need to find all close matches that are a certain distance from the mean or all the hash key/val pairs or does the match need to match the outside value? More info will get you a better answer here. You cant kill us with information, you can only make the responces more true to your actual situation.

        -Waswas

        Perhaps the following method could be used:

        1. output each hash to a file, sorted alfabetically by key
        2. then run each file against the file of the target hash (or even each other file, if you need to do a full cross control) through a "diff" program, parsing the output with Perl and calculating the difference (e.g. score = number of lines different + number of lines missing + number of lines added)
        3. sorting the "results" from the diffs and finding the hash which is closest to the target hash

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: The Best Hash
by BrowserUk (Patriarch) on Dec 14, 2002 at 07:59 UTC

    How I can decide which hash is the least differing from other hashes?

    If I understand you correctly, you want to measure the 'fit' of each ordered set of hash values against every other set of hash values and accumulate their scores and sort them via that criteria.

    For this purpose a (A,B,C,D) -v- (A,B,D,C) scores 2, and (A,B,C,D) -v- (B,A,D,C) scores 0.

    This is an O(n^2) problem. Add to this the cost of comparing two arrays of strings and counting the matches and you have a slow and expensive process.

    Those little diagrams above gave me an idea.

    If you XOR a number with itself, you get zero. Perl (uniquely?) can do bitwise operations on strings. If you XOR 2 strings and they are the same, you end up with a string of null bytes. A slow method of comparing two string, and no good at all for comparing arrays of strings, even if you concatenated the arrays into a single string unless all the string were the same length. However, you said that the set of values in each hash is the same, its only the ordering that distinguishes them. Therefore it is possible to represent each set of values by a string with each value represented by a single character (provided that there are less than 255 unique values in each hash).

    Once you have the sets represented this way, deriving the 'fit' of any pair of hashes comes down to XORing 2 substitute strings and counting the number of null bytes. Doing this for all the pairs and accumulating the scores determines a fit value of each hash relative to every other.

    You then need to order these fits whilst retaining the index for which hash they belong to. You can then use the print them out in order of relative fit.

    The code below does this for 50 hashes of 9 values each in a few seconds. You don't say how many hashes or how large your sets, but it may complete in a reasonable time if your numbers aren't too extreme.

    The whole process only takes 6 (fairly complex) lines:)

    #! perl -slw use strict; use Inline::Files; #! Not necessary for the algorithm. #! Just convenience in my test program #! If anyone else gets the following warnings from the #! above module, please let me know. =pod Comment only Use of uninitialized value in length at e:/Perl/site/lib/Inline/Files/ +Virtual.pm line 89, <FILE> chunk 1. Use of uninitialized value in transliteration (tr///) at e:/Perl/site/ +lib/Inline/Files/Virtual.pm line 167, <FILE> chunk 1. =cut #! Some silly hash keys for testing purposes. my @keys = qw(one two three four five six seven eight nine); #! Build an array of hashes from the __DATA__ section. #! The array above as keys, the words from each line as values my @AoH; push @AoH, do{ my $i=0; +{ map{ $keys[$i++] => $_ } split } } while <D +ATA>; #! Build a lookup table from one (any) hash's values. my %lookup = do{ my $i=1; map{ $_ => $i++ } values %{$AoH[0]} }; #! Build an array of substitute strings, mapping each value to a singl +e char. #! I've used an offset of 65 to make things readable when testing. #! Start at chr(1) if your hashes have more than 222 values. #! You might need to specify [no utf-8] if you use values above 127? I +'m not sure. my @subs; push @subs, join'', map{ chr 65+$lookup{$_} } values %{$_} for @AoH; #! XOR every substitute with every substitute, count the nuls in the r +esult #! Accumulate them in an array in the same order as the subs. my @fits = ( (0)x@subs ); for my $i (0 .. $#subs) { $fits[$i] += scalar ($subs[$i] ^ $_) =~ tr/\0// for @subs; } #! The sort is a slightly funky (I wonder where I got that word from:) + version of the ST sort. #! It feeds in and sorts on the values of the array and feeds out the +indices of the sorted values. my @sorted = map{ $_->[1] } sort{ $b->[0] <=> $a->[0] } #! Descending sort, higher is better f +it. map{ [ $fits[$_] => $_ ] } 0 .. $#fits; #! Output the values sets ordered by and prefixed with their relative +'fit'. print OUTPUT "$fits[$_] : @{[ values %{$AoH[$_]} ]}\n" for @sorted;

    test data

    Results

    I hope the results mean something to you, cos I can't see the pattern:)


    Examine what is said, not who speaks.

Re: The Best Hash
by dakkar (Hermit) on Dec 13, 2002 at 19:44 UTC

    You say you have the same keys and the same values in each hash, right? Meaning that doing:

    @keys=keys %hash1; @v1=@hash1{@keys}; @v2=@hash2{@keys}; @v3=@hash3{@keys};
    will result in @v1 etc. being each a permutation of the others?

    If this is the case, I'd use the minimum number of swaps necessary to permute one hash into the other.

    Keep in mind that differences are usually defined between two objects. What do you mean with 'least differing from other hashes'?

    -- 
            dakkar - Mobilis in mobile
    
Re: The Best Hash
by artist (Parson) on Dec 13, 2002 at 21:16 UTC
    Dear Monks,

    Some more answers to assist my question.

    Dakkar:
    Yes, @v1, @v2 etc .. would be permutation of others. The 'difference' between 2 same key-valued hashes is the total number of keys in a hash - total number of common keys-value pairs between two hashes.

    Mojotoad:
    Funny u said that, I cannot exactly figure out what u mean by those words. Still your assumption is close, as I my key/values are just text strings.

    Waswas-fng:
    Yes, the matches need to be exact and the keys and values are same in all the hashes. so there is no need for SOUNDEX like modules. I also explained the idea of the distance here. Which Modules would be helpful from CPAN?

    More Notes:
    I can take one hash 'H' at a time, compared to all others and and count the sum of the differences S(H) and finally can find 'H' with the least value of S(H). Anything better I can do?

    Thanks,
    Artist

      Yes, @v1, @v2 etc .. would be permutation of others. The 'difference' between 2 same key-valued hashes is the total number of keys in a hash - total number of common keys-value pairs between two hashes.

      It sounds like you might actually be trying to cluster the hashes based on relative similarity to each other -- or maybe you just want to identify the two hashes that are most similar to each other.

      Anyway, it might be interesting to treat the key-value pairs as strings, print the contents of each hash as a sorted list of these strings, then do pair-wise diffs on the text files. The pair with the fewest diffs is the closest match. The following has not been tested, but perhaps it will be useful.

      sub printHash { my ( $hash_name, $hash_ref ) = @_; open( LIST, ">", $hash_name ) or die $!; print LIST sort map { "$_ $$hash_ref{$_}\n" } keys %$hash_ref; close LIST; } ... # assume you have an "ur-hash", holding hash_refs keyed by hash_name: my @difflist1 = my @difflist2 = sort keys %all_hashes; foreach my $name ( keys %all_hashes ) { printHash( $name, $all_hashes{$name} ); } my %distances; foreach my $h1 ( @difflist1 ) { shift @difflist2; # dispose of the identical item in list2 foreach my $h2 ( @difflist2 ) { $distances{"$h1:$h2"} = `diff $h1 $h2 | grep -c '^<>'`; } } # then do something with the values in %distances

      (update: fixed to read $$hash_ref{$_} in the subroutine)

Re: The Best Hash
by mojotoad (Monsignor) on Dec 13, 2002 at 20:04 UTC
    Perhaps you could be more specific? I'm not sure if you mean Gold/Red Seal or Power Skunk, or what. Perhaps using the colloquial terms such as jay, boo yah, maui powie, mex, or thai sticks might help narrow it down.

    This might help refine your question:

    Marijuana Varieties from weedguru.com.

    Further resources: Marijuana FAQ, MFAQ from hempfiles.com

    ;)
    Matt

Re: The Best Hash
by Anonymous Monk on Dec 17, 2002 at 02:28 UTC
    Amsterdam, summer of 1984.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://219679]
Approved by vek
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2024-03-28 19:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found