The Uniqueness of hashes.

Greetings all,
Perl hashes give us an unique ability. The power of uniqueness. Below are a few techniques I have built up over my past few years here at the Monastery, I will do my best to give credit where credit is due.
Also keep in mind these examples where developed using

perl, v5.8.0 built for MSWin32-x86-multi-thread
[download]

Setting up.

It becomes important to have a method for looking at what is going on under the hood as a result of all the following transformations, and especially important when adopting the use of complex data structures, like those listed in perldsc. This way you can check your work at any point in the process and better understand what is being created and manipulated. I have included such output after the __OUTPUT__ line within each example.
Let's start by setting up our investigative tool. Though I generally use Dumpvalue I'll go with Data::Dumper since more people are familiar with its output.

#!/usr/bin/perl
 
use strict;
use warnings;

sub dump_ref
{
    use Data::Dumper;
    $Data::Dumper::Terse = 1;
    $Data::Dumper::Indent = ((ref $_[0]) =~ /ARRAY/i)?0:1;
    $Data::Dumper::Sortkeys = sub{no warnings; return [ sort{$a<=>$b |
+| $a cmp $b} keys %{$_[0]} ]; use warnings;};
    $_[1] = "Var" if(!defined $_[1]);
    print "$_[1] = \n";
    print Dumper($_[0]);
    print "\n";
}
[download]

A Note about %_

Some may argue that using %_ is a bad idea or that it may be used internally at some later time. Personally I like using %_ in certain situations. I am on the "it is the same as using $_ and @_..." side of that argument. My only suggestion/caution is that if you adopt the usage of %_ simply make sure you localize it prior to any action you take with it. This generally restricts the usage to within closures, much like the do block examples below.
Here are some nodes relating to %_ and its usage. (compliments of planetscape)

Unique Values

Now say we want unique values from any number of arrays. Let's use the special hash %_, combined with a do block to assign the transformation to a new variable as illustrated in the threaded responses by Aristotle and Juerd of a tutorial about Perl idioms (Perl Idioms Explained - keys %{{map{$_=>1}@list}}). We take advantage of the implied return of our do block much like those of map and grep statements.


my @array1 = (1 .. 20);
my @array2 = (10 .. 30);
my @array3 = (19 .. 40);

my @unique_numbers = do{

    local %_;
    undef @_{@array1, @array2, @array3};
    sort {$a<=>$b} keys %_;
};
dump_ref(\@unique_numbers, '@unique_numbers');

__OUTPUT__

@unique_numbers = 
['1','2','3','4','5','6','7','8','9','10',
'11','12','13','14','15','16','17','18','19','20',
'21','22','23','24','25','26','27','28','29','30',
'31','32','33','34','35','36','37','38','39','40']
[download]

For the sake of not polluting %_ too much we localize its value within the do block. Then calling %_ as a hash slice @_{ (...list...) } we assign all the values from @array1, @array2 & @array3 as keys. The undef call sets all the new key => values to undef so we save a little space and time. The advantage here is that hashes cannot have duplicate keys, any given value can only exist once as a hash key. So all the keys get flattened in a sense, to a list of unique values. The last statement in effect returns an array numerically sorted from the list of unique values returned by the keys call to %_.

Collecting duplicates

Now suppose instead we wanted to keep duplicates. We would need to select a different structure and it just so happens that would be a hash. Why? because it allows us to utilize the implicit filtering capabilities of a unique set (hash keys call) without losing any of the contributing sets' values. In this case we are going to store array references in the elements of a hash whose keys are unique values from all of the contributing data, in our example the three previous arrays. This is easily extendable for deeper contributing data, AoH, AoA, HoH, etc.


my %unique_sets;
push @{$unique_sets{$_}},$_ for(@array1, @array2, @array3);
dump_ref(\%unique_sets, '%unique_sets');

__OUTPUT__

%unique_sets = 
{
  '1' => [1],
  '2' => [2],
  '3' => [3],
  '4' => [4],
  '5' => [5],
  '6' => [6],
  '7' => [7],
  '8' => [8],
  '9' => [9],
  '10' => [10,10],
  '11' => [11,11],
  '12' => [12,12],
  '13' => [13,13],
  '14' => [14,14],
  '15' => [15,15],
  '16' => [16,16],
  '17' => [17,17],
  '18' => [18,18],
  '19' => [19,19,19],
  '20' => [20,20,20],
  '21' => [21,21],
  '22' => [22,22],
  '23' => [23,23],
  '24' => [24,24],
  '25' => [25,25],
  '26' => [26,26],
  '27' => [27,27],
  '28' => [28,28],
  '29' => [29,29],
  '30' => [30,30],
  '31' => [31],
  '32' => [32],
  '33' => [33],
  '34' => [34],
  '35' => [35],
  '36' => [36],
  '37' => [37],
  '38' => [38],
  '39' => [39],
  '40' => [40]
}
[download]

Above we didn't use a do block but after declaring our structure %unique_sets we did take advantage of auto-vivification. Our treatment of $unique_sets{$_} as an array reference quickly allowed us to store our duplicate values and filter them for uniqueness all in one step.

Side-note: Auto-vivification does have its drawbacks and though they are not illustrated with the example above This node has tlm's thoughts on the subject and a few cases to watch out for.

Unique value frequencies

Now let's see if we can get a count of how many times each value occurs within our lists.


my %unique_counts = do{ 
    local %_;
    $_{$_}++ for(@array1, @array2, @array3); 
    %_;
};
dump_ref(\%unique_counts, '%unique_counts');

__OUTPUT__

%unique_counts = 
{
  '1' => 1,
  '2' => 1,
  '3' => 1,
  '4' => 1,
  '5' => 1,
  '6' => 1,
  '7' => 1,
  '8' => 1,
  '9' => 1,
  '10' => 2,
  '11' => 2,
  '12' => 2,
  '13' => 2,
  '14' => 2,
  '15' => 2,
  '16' => 2,
  '17' => 2,
  '18' => 2,
  '19' => 3,
  '20' => 3,
  '21' => 2,
  '22' => 2,
  '23' => 2,
  '24' => 2,
  '25' => 2,
  '26' => 2,
  '27' => 2,
  '28' => 2,
  '29' => 2,
  '30' => 2,
  '31' => 1,
  '32' => 1,
  '33' => 1,
  '34' => 1,
  '35' => 1,
  '36' => 1,
  '37' => 1,
  '38' => 1,
  '39' => 1,
  '40' => 1
}
[download]

Again we localize %_ yet with a post increment this time to count up occurances of each value from the contributing arrays.

Quick filtering

Okay so we have gone through some quick and easy combinations. Let's try some filtering. Let's say we are interested in only those values that are duplicates? or non-duplicates for that matter.


my %duplicates = do{
    local %_;
    $_{$_}++ for(@array1, @array2, @array3);
    delete @_{ (map{ ($_{$_}==1) ? $_ : () ; } keys %_ ) };
    %_;

};
dump_ref(\%duplicates, '%duplicates');

__OUTPUT__

%duplicates = 
{
  '10' => 2,
  '11' => 2,
  '12' => 2,
  '13' => 2,
  '14' => 2,
  '15' => 2,
  '16' => 2,
  '17' => 2,
  '18' => 2,
  '19' => 3,
  '20' => 3,
  '21' => 2,
  '22' => 2,
  '23' => 2,
  '24' => 2,
  '25' => 2,
  '26' => 2,
  '27' => 2,
  '28' => 2,
  '29' => 2,
  '30' => 2
}
[download]


my %non_duplicates = do{
    local %_;
    $_{$_}++ for(@array1, @array2, @array3);
    delete @_{ (map{ ($_{$_}>1) ? $_ : () ; } keys %_ ) };
    %_;
};
dump_ref(\%non_duplicates, '%non_duplicates');

__OUTPUT__

%non_duplicates = 
{
  '1' => 1,
  '2' => 1,
  '3' => 1,
  '4' => 1,
  '5' => 1,
  '6' => 1,
  '7' => 1,
  '8' => 1,
  '9' => 1,
  '31' => 1,
  '32' => 1,
  '33' => 1,
  '34' => 1,
  '35' => 1,
  '36' => 1,
  '37' => 1,
  '38' => 1,
  '39' => 1,
  '40' => 1
}
[download]

Above we use delete much like we did undef earlier. delete will work on a hash slice and facilitates the removal of keys from our local copy of %_. By calling map with ()'s we create a list of those keys whose values pass the ternary operation criteria (Condition)?True:False;. By returning either a value $_ or an empty list () this filters only the keys of interest for delete to work on.

Simple Aggregation

Now what if we wanted to keep our arrays from %unique_sets, but add the counts from %unique_counts into one structure?


my %unique_descriptive = do {
    local %_;
    for(@array1, @array2, @array3){
        $_{$_}->{count}++ ; 
        push @{$_{$_}->{values}},$_;
    }
    %_;
};
dump_ref(\%unique_descriptive, '%unique_descriptive');

__OUTPUT__

%unique_descriptive = 
{
  '1' => {
    'count' => 1,
    'values' => [1]
  },
  '2' => {
    'count' => 1,
    'values' => [2]
  },

  ...clipped for brevity...
  
  '11' => {
    'count' => 2,
    'values' => [11,11]
  },
  '12' => {
    'count' => 2,
    'values' => [12,12]
  },

  ...clipped for brevity...

  '19' => {
    'count' => 3,
    'values' => [19,19,19]
  },
  '20' => {
    'count' => 3,
    'values' => [20,20,20]
  },
  '21' => {
    'count' => 2,
    'values' => [21,21]
  },

  ...clipped for brevity...

  '40' => {
    'count' => 1,
    'values' => [40]
  }
}
[download]

Here we auto-vivify a hash reference and assign it to the local copy of %_ with our $_{$_}->{count}++ call. The same theme runs through the array reference created in the line that follows. Isn't Perl fun? This allows for quick transformations and filtering of the underlying data. Just as an example let's get the highest value that occurs the most frequently followed by the highest value that occurs the least. We will use the structure %unique_descriptive that we created above.


my %max = do{
    local %_ = %unique_descriptive;
    %{
    shift @{[ 
        map{$_->[2]} 
        sort{$b->[0] <=> $a->[0] || $b->[1] <=> $a->[1]} 
        map{[$_{$_}->{count}, $_, $_{$_}]} keys %_ 
        ]}
    };
};
dump_ref(\%max,'%max');

__OUTPUT__

%max = 
{
  'count' => 3,
  'values' => [20,20,20]
}
[download]


my %min = do{
        local %_ = %unique_descriptive;
        %{$_{shift @{[ 
            sort{ $_{$a}->{count} <=> $_{$b}->{count} || $b <=> $a} ke
+ys %_ 
        ]}}};
};
dump_ref(\%min,'%min');

__OUTPUT__

%min = 
{
  'count' => 1,
  'values' => [40]
}
[download]

In the first example the use of the Schwartzian Transformation makes the operation possible. We also use the nature of shift as the final filtering step. With our lower map we take in the keys of %_ which is our unique values list in this instance. Next as with any ST we transform the data, this time into an array with the count value at [0], the unique value at [1], and the hash reference stored in the unique key from the contributing hash at [2]. The sort then uses the count value first followed by the unique value as a second option. Next we map the element of interest back out of our newly sorted list, in this case the hash reference stored at [2]. We take the first one off the top with a call to shift (so likewise we could just as easily have popped off an element or even took a slice/splice), but since shift is expecting an array we make due by dereferencing an anonymous structure populated by our ST.

    shift @{[ 
        map{$_->[2]} 
        sort{$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} 
        map{[$_{$_}->{count}, $_, $_{$_}]} keys %_ 
        ]}
[download]

Now shift is happy and so are we. The finishing touch is to dereference the hash returned by shift. We accomplish this by wrapping it in %{ shift @{[...#ST code...]} }.

The second example uses a convoluted sort alone to filter the data. Even though the second approach might ultimately win a Benchmark race, the syntax is far beyond intuitive. Lets explain a bit of what is going on.

... = do {
        local %_ = %unique_descriptive;
        %{$_{shift @{[ 
            sort{ $_{$a}->{count} <=> $_{$b}->{count} || $b <=> $a} ke
+ys %_ 
        ]}}};
};
[download]

Again we localize %_ so we know its not carrying any baggage from another scope and that it will forget our little tryst once our do block is done.
Next lets start from the inside out. Knowing that we have unique keys in our local copy of %_ we can call keys to get the ball rolling. The sort function is used first to compare the count values of the underlying hash references in %_ ex: ( $_{$a}->{count} <=> $_{$b}->{count} ) and second to compare the values of the keys as they come in. Now knowing that sort will return a list (much like map & grep) we wrap it in an anonymous array construct so we can dereference it and treat the entire thing like an array for our shift to work with; This is along the same vein as the prior example with the ST. Next we use the value returned by shift as the key of interest in our local copy of %_ so just a little bit of syntactic sugar and we are home free.

**A recent real world example:**

(OP shall remain anonymous)
Suppose you have a number of files "SomeOutPut1.dat", "SomeOutPut2.dat",... and in each there is some data of interest. The pattern of the lines of interest is a timestamp (float) followed by a space and then some text. The timestamps may have duplicates. Now you are in charge of filtering and collapsing duplicates from this data and by the end of the day your boss wants a list of all the data sorted by timestamp in one file. Sounds like a good job for a hash.

#!/usr/bin/perl -w
use strict;
my %d = map{
        #cleanup the line
        chomp;
        #does it match
        /^([\d\.]+) \w+/ ? ($1, $_) : ();
    }map{
        #use @ARGV and <> magic to 
        #read in each line from each file.
        local @ARGV = ($_);
        <>;
    }<SomeOutPut*.dat>; #glob in all files that match 

open(OUTPUT, ">unique.dat") or die "a horrible death... $!";
print OUTPUT, $d{$_}."\n" for(sort keys %d);
close OUTPUT;
[download]

Above we used a few tricks. First we globbed in the files matching <SomeOutPut*.dat> This will feed in each filename that matches the pattern into our lower map block. Next we localize @ARGV and set it to the incoming name from our glob (here is why). Using the diamond operator we feed in each line from the current file into our upper map block. We first clean up the line with a call to chomp followed by a quick pattern match to see if this line follows the format specified. Those lines that pass have their captured timestamp used to uniquely identify this line in the resultant hash %d. Returning a list ($1, $_) is functionally equivalent to $1 => $_ for assigning a key => value pair in a hash. In the case of a duplicate only one value will remain, since hash keys are unique right? Finally we open the eventual output file "unique.dat" for our boss and print each line sorted by the timestamps in our hash.
Nicely done, get yourself some coffee...

Of course with the mantra of Perl being TIMTOWTDI these examples are simply one way among many for accomplishing these types of tasks. I prefer using hashes and the special variables to minimize the amount of symbols I have to keep track of, I guess some might argue that its too hard to read. My advice is to do what makes the most sense to you but keep in mind that you can take advantage of inherent benefits of Perl hashes to quickly organize and later filter your data. And well packaged data often leads to more efficient programming.

update
Thanks planetscape for the editorial comments & the subject references & the proofing & the patience & .... :)
Thanks jdporter for the lower map suggestion. Much cleaner. Way better than the if(open (IN, $_)){#stuff} I was using.
Thanks again to jdporter for golfing out the former grep into a clean map with a ternary.
Changed the title to remove the "RFC:" since not many are coming my way.

-InjunJoel

"I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo

Comment on The Uniqueness of hashes. Select or Download Code


Keep It Simple, Stupid
	PerlMonks