Re^2: A more elegant way to filter a nested hash? Benchmarks!

Hi jimpudar, thanks for posting those results. Although the tests are probably fine for the original problem that you posted I had the same concerns as vr. I think the payload is too small for testing so I started to do some testing myself.

Just for the fun of it, I creating some tests that use much bigger hash structures in memory (my aim was memory sizes up to 1.2 to 1.8 Gb of data). Also I wanted some variance in the nature on how the hash is created. E.g. Hashes with lot's of values and fewer hashes vs. Hashes with lot's of hashes containing fewer values. And small filters vs. big filters

Here are the results of my tests:

1. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~300k hashes. Filter 70% hashes and 88% values of original hash. 10 test cycles. Memory usage approx. 1.4 Gb.

Number of Hashes = 305798
Number of Values = 4892785
Number of Hashes Filtered = 268718
Number of Values Filtered = 3435051
Filtering 88% Hashes
Filtering 70% Values
           s/iter map_grep_2   map_grep      slice  recursive  iterati
+ve
map_grep_2   6.29         --        -1%        -8%       -10%       -1
+6%
map_grep     6.22         1%         --        -7%        -9%       -1
+5%
slice        5.81         8%         7%         --        -2%        -
+9%
recursive    5.68        11%        10%         2%         --        -
+7%
iterative    5.27        19%        18%        10%         8%         
+--
[download]

iterative seems to perform very well under these circumstances. I ran this test multiple times, occasionally recursive or slice wins

2. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~400k hashes. Filter 50% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.7 Gb.

Number of Hashes = 405929
Number of Values = 6494881
Number of Hashes Filtered = 203883
Number of Values Filtered = 660915
Filtering 50% Hashes
Filtering 10% Values
           s/iter  iterative   map_grep map_grep_2      slice  recursi
+ve
iterative    1.67         --        -1%        -1%        -7%       -1
+2%
map_grep     1.65         1%         --        -0%        -6%       -1
+1%
map_grep_2   1.65         1%         0%         --        -6%       -1
+0%
slice        1.55         7%         6%         6%         --        -
+5%
recursive    1.48        13%        12%        12%         5%         
+--
[download]

3. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.6M hashes. Filter 87% hashes and 69% values of original hash. 10 test cycles. Memory usage approx. 2.4 Gb.

Number of Hashes = 1579338
Number of Values = 6317357
Number of Hashes Filtered = 1369612
Number of Values Filtered = 4362177
Filtering 87% Hashes
Filtering 69% Values
           s/iter map_grep_2      slice   map_grep  recursive  iterati
+ve
map_grep_2   10.7         --        -0%        -3%        -6%        -
+6%
slice        10.7         0%         --        -3%        -6%        -
+6%
map_grep     10.4         3%         3%         --        -3%        -
+3%
recursive    10.1         6%         6%         3%         --        -
+0%
iterative    10.0         7%         7%         3%         0%         
+--
[download]

4. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.2M hashes. Filter 40% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.5 Gb.

Number of Hashes = 1226812
Number of Values = 4907253
Number of Hashes Filtered = 492351
Number of Values Filtered = 509410
Filtering 40% Hashes
Filtering 10% Values
           s/iter  iterative      slice   map_grep map_grep_2  recursi
+ve
iterative    2.56         --       -16%       -17%       -17%       -2
+2%
slice        2.15        19%         --        -1%        -1%        -
+7%
map_grep     2.13        20%         1%         --        -0%        -
+6%
map_grep_2   2.12        20%         1%         0%         --        -
+6%
recursive    1.99        28%         8%         7%         6%         
+--
[download]

The last test shows recursive fast, however tests 2 and 4 are as far as I am concerned less interesting since they run a shorter time and only show a difference in a couple of tenth of seconds.

After testing I think that the recursive function is still good solution, although my tests show that it is actually iterative that won the most complex schemes. Some of the algorithms seem to run slower or faster under different circumstances. I generated different hash structures and filter structures and I think that each of these differences may be more or less beneficial to the different algorithms. This I find more important because as far as I am concerned choosing the best algorithm should always be done on based on the biggest time loss.

Eventually I have to conclude, what are we actually optimizing? Readable code vs. a win in one or two seconds for processing gigabytes of data?

Sorry for not adding mr_ron's tests. I couldn't get it to work

I did optimize the slice method though, it had an extra keys instruction that was not needed, but I also added the check for valid filter:

sub hash_slice {
    my $n ;
    my @keysToGet = keys %{$_[1]};
    @{$n}{@keysToGet} = @{$_[0]}{@keysToGet};
    foreach ( @keysToGet ) {
        if ( ref $_[1]->{ $_ } eq 'HASH' ) {
            croak "bad filter: on '$_', expected HASH\n" unless ( ref 
+$_[0]->{ $_ } eq 'HASH' ) ;
            $n->{$_} = hash_slice( $_[0]->{$_}, $_[1]->{$_} );
        }
    }
    return $n ;
}
[download]

Here's the code that I used:

#! /usr/bin/env perl

use strict ;
use warnings ;
use Data::Dumper ;

use Carp ;
use Test::More ;
use Benchmark qw(timethese cmpthese) ;

use Deep::Hash::Utils qw(reach nest deepvalue) ;

BEGIN {
    sub round {
       my $number = shift ;
       return int( $number + .5 * ( $number <=> 0 ) ) ;
    }
    sub _nextUnique {
        my $uniqueNmbr = 1 ;
        my $uniqueVal = -1 ;
        return sub {    
            my $unique = '' ;
            ++$uniqueVal ;
            if ( ( int ( $uniqueVal / 256**$uniqueNmbr ) ) == 1 ) {
                ++$uniqueNmbr ;
                --$uniqueVal ;
                for ( my $i = $uniqueNmbr ; $i > 0 ; --$i ) {
                    $unique = $unique . chr( 0 ) ;
                }
                return $unique
            }
            my $copyUniqueVal = $uniqueVal ;
            for ( my $i = $uniqueNmbr ; $i > 0 ; --$i ) {
                my $devider = 256**($i - 1) ;
                ( my $c, $copyUniqueVal ) = ( int $copyUniqueVal / $de
+vider, $copyUniqueVal % $devider ) ;
                $unique = chr( $c ) . $unique ;
            }
            return $unique ;
        }
    }
    *nextUnique = _nextUnique ;
    
    # Test nextUnique
    # for ( my $i = 0 ; $i < 65538 ; ++$i ) {
    #     my $u = nextUnique() ;
    #     my @ASCII = unpack( "C*", $u ) ;
    #     print $i . " @ASCII" . "\n" ;
    # }
}

sub throwDice {
    my $dice = {
        # The value and relative chance
        1 => 1,
        2 => 1,
        3 => 1,
        4 => 1,
        5 => 1
    } ;
    my $retitems = [] ;
    my $total = 5 ;
    my $nmbr = 2 ; # Roll two dices
    for( my $i = 1; $i <= $nmbr ; ++$i ) {
        my $rnd = rand( $total ) ;
        my $counter = 0 ;
        foreach ( keys %{ $dice } ) {
            $counter = $counter + $dice->{ $_ } ;
            if ( $rnd < $counter ) {
                push( @{ $retitems }, $_ ) ;
                last ;
            }
        }
    }
    return $retitems ;
}
# Test throwDice
# my $numbers = throwDice() ;
# print "numbers = @{$numbers}\n" ;

my $nH = 0 ; # Number of Hashes
my $nV = 0 ; # Number of Values
sub createHash {
    # This sub is super random in nature, will sometimes take a couple
+ of Mb's and other times Gb's
    my $depth = $_[0] // 0 ;
    ++$depth ;
    my $s = {} ;
    my $f = {} ;

    # Test 1 and 2
    for ( my $i = 1 ; $i <= 17 ; ++$i ) {

    # Test 3 and 4
    # for ( my $i = 1 ; $i <= 5 ; ++$i ) {

        my $rnd = throwDice() ;

        # Test 1 and 2
        # Less hashes, more values
        if ( $rnd->[0] > 4 && $depth <= 10 ) {
        
        # Test 3 and 4
        # More hashes, less values
        # if ( $rnd->[0] > 1 && $depth <= 10 ) {
        
            my $key = nextUnique() ;
            my $randval = rand( 1 ) ;

            # $rnd->[1] > ...
            # > 0 means 100% chance to add it in the filter, > 1 = 80%
+, > 2 = 60%, > 3 = 40%, > 4 = 20%, > 5 = 0%
            # Note that this is not the total chance, the filter may b
+e discarded.
            # To decrease the chance that an ENTIRE hash branch is rej
+ected in the filter,
            # on top of this an additional 10% chance multiplier is ad
+ded ( See: $randval > 0.1 )

            # Test 1 and 3
            if ( ( $rnd->[1] > 1 ) || ( $randval > 0.1 ) ) {
            
            # Test 2 and 4
            # if ( ( $rnd->[1] > 4 ) || ( $randval > 0.1 ) ) {

                ++$nH ;
                ( $s->{ $key }, $f->{ $key } ) = createHash( $depth ) 
+;
                # IF $f->{ $key } IS {} THEN DELETE $f->{ $key } (Not 
+a valid filter for testing)
                if ( !(%{$f->{ $key }}) ) {
                    delete $f->{ $key } ;
                }
            } else {
                ++$nH ;
                ( $s->{ $key }, undef ) = createHash( $depth ) ;
                # undef, hash is discarded (not added to the filter)
            }
        } else {
            my $key = nextUnique() ;
            my $value = $depth . '.' . $i . "ABCDEFGHIJKLMNOPQRSTUVWXY
+Z" ;
            ++$nV ;
            $s->{ $key } = $value ;

            # $rnd->[1] > ...
            # > 0 means 100% chance to add it in the filter, > 1 = 80%
+, > 2 = 60%, > 3 = 40%, > 4 = 20%, > 5 = 0%
            # Note that this is not the total chance, the filter may b
+e discarded

            # Test 1 and 3
            if ( $rnd->[1] > 1 ) {

            # Test 2 and 4
            # if ( $rnd->[1] > 4 ) {

                $f->{ $key } = 1 ;
            }
        }
    }
    return ($s, $f) ;
}

my $nHF = 0 ; # Number of Hashes Filtered
my $nVF = 0 ; # Number of Values Filtered
sub countHashesAndValuesFilter {
    my $f = shift ;
    foreach ( keys %{ $f } ) {
        my $k = $_ ;
        if ( ref $f->{ $k } eq 'HASH' ) {
            ++$nHF;
            countHashesAndValuesFilter( $f->{ $k } ) ;
        } else {
            ++$nVF;
        }
    }
}

my ( $source, $filter ) = createHash() ;
countHashesAndValuesFilter( $filter ) ;
print "Created test hash\n" ;
if ( $nH < 100000 || $nV < 100000 ) {
    # Can probably be set higher, but at least it helps a little bit t
+o get better test data
    print "Invalid, test again\n" ;
    exit(0) ;
}
print "Number of Hashes = $nH\n" ;
print "Number of Values = $nV\n" ;
print "Number of Hashes Filtered = $nHF\n" ;
print "Number of Values Filtered = $nVF\n" ;
print "Filtering " . round(($nHF/$nH)*100) . "% Hashes\n" ;
print "Filtering " . round(($nVF/$nV)*100) . "% Values\n" ;
sleep( 2 ) ;

sub hash_filter_recursive {
  my $source = shift;
  my $filter = shift;

  my %output;

  foreach ( keys %$filter ) {
    if ( exists $source->{$_} ) {

      if ( ref $filter->{$_} eq 'HASH' ) {
        croak "bad filter: on '$_', expected HASH\n"
          unless ( ref $source->{$_} eq 'HASH' );

        $output{$_} = hash_filter_recursive( $source->{$_}, $filter->{
+$_} );
      }
      else {
        $output{$_} = $source->{$_};
      }

    }
  }

  return \%output;
}

sub hash_filter_iterative {

  my $source = shift;
  my $filter = shift;

  my $output = {};

  my @queue = ( [ $source, $filter, $output ] );

  while ( my $a = shift @queue ) {
    my ( $s, $f, $o ) = @{$a};
    foreach ( keys %$f ) {
      if ( exists $s->{$_} ) {

        if ( ref $f->{$_} eq 'HASH' ) {
          croak "bad filter: on '$_', expected HASH\n"
            unless ( ref $s->{$_} eq 'HASH' );

          $o->{$_} = {};
          push @queue, [ $s->{$_}, $f->{$_}, $o->{$_} ];
        }
        else {
          $o->{$_} = $s->{$_};
        }

      }
    }
  }

  return $output;
}

sub hash_filter_delete {
  my $source = shift;
  my $filter = shift;

  foreach ( keys %$source ) {
    if ( exists $filter->{$_} ) {

      if ( ref $filter->{$_} eq 'HASH' ) {
        croak "bad filter: on '$_', expected HASH\n"
          unless ( ref $source->{$_} eq 'HASH' );

        hash_filter_delete( $source->{$_}, $filter->{$_} );
      }

    }
    else {

      delete $source->{$_};

    }
  }

  return $source;
}

sub hash_slice {

    # contributed by Veltro
    
    # Veltro: Made more efficient by using @keysToGet instead of anoth
+er keys instruction. Also added croak in case filter is not valid.

    my $n ;
    my @keysToGet = keys %{$_[1]};
    @{$n}{@keysToGet} = @{$_[0]}{@keysToGet};
    foreach ( @keysToGet ) {
        if ( ref $_[1]->{ $_ } eq 'HASH' ) {
            croak "bad filter: on '$_', expected HASH\n" unless ( ref 
+$_[0]->{ $_ } eq 'HASH' ) ;
            $n->{$_} = hash_slice( $_[0]->{$_}, $_[1]->{$_} );
        }
    }
    return $n ;
}

sub map_grep {

  # contributed by shmem

  my $source = shift;
  my $filter = shift;

  return {
    map {
          ref $filter->{$_} eq 'HASH'
        ? ref $source->{$_} eq 'HASH'
          ? ( $_, map_grep( $source->{$_}, $filter->{$_} ) )
          : croak "bad filter: on '$_', expected HASH\n"
        : ( $_, $source->{$_} )
    } grep { exists $source->{$_} } keys %$filter
  };
}

sub map_grep_2 {

  # contributed by shmem

  return {
    map {
          ref $_[1]->{$_} eq 'HASH'
        ? ref $_[0]->{$_} eq 'HASH'
          ? ( $_, map_grep_2( $_[0]->{$_}, $_[1]->{$_} ) )
          : croak "bad filter: on '$_', expected HASH\n"
        : ( $_, $_[0]->{$_} )
    } grep { exists $_[0]->{$_} } keys %{ $_[1] }
  };
}

sub hash_deep_utils {

  # contributed by mr_ron

  my ( $source, $filter ) = @_;
  my %rc;

  while ( my @l = reach($filter) ) {
    pop @l;
    if ( defined( my $source_val = deepvalue( $source, @l ) ) ) {

      # hint: work around nest behavior on even vs odd key count
      nest( \%rc, @l )->{ $l[$#l] } = $source_val;
    }
  }

  \%rc;
}

my $t = {
  source => $source,
  filter => $filter,
  output => map_grep_2($source, $filter)
} ;

$nHF = 0 ;
$nVF = 0 ;
# Using this function again to do some additional checks
countHashesAndValuesFilter( $t->{ output } ) ;
print "Values should be the same as filter values:\n" ;
print "Number of hashes in output: $nHF\n" ;
print "Number of values in output: $nVF\n" ;

# Veltro: Removed hash_filter_delete, I don't trust that it will leave
+ the test hash in tact
is_deeply( $_->( $t->{source}, $t->{filter} ), $t->{output} ) foreach 
+( \&hash_filter_recursive, \&hash_filter_iterative, \&hash_slice, \&m
+ap_grep, \&map_grep_2 );

cmpthese(
  10,
  {
    recursive  => sub { hash_filter_recursive( $t->{source}, $t->{filt
+er} ) },
    iterative  => sub { hash_filter_iterative( $t->{source}, $t->{filt
+er} ) },
    # delete     => sub { hash_filter_delete( $t->{source},    $t->{fi
+lter} ) },
    map_grep_2 => sub { map_grep_2( $t->{source},            $t->{filt
+er} ) },
    map_grep   => sub { map_grep( $t->{source},              $t->{filt
+er} ) },
    slice      => sub { hash_slice( $t->{source},            $t->{filt
+er} ) },
  }
);

# Check if the data is still the same after testing
is_deeply( $_->( $t->{source}, $t->{filter} ), $t->{output} ) foreach 
+( \&hash_filter_recursive, \&hash_filter_iterative, \&hash_slice, \&m
+ap_grep, \&map_grep_2 );

done_testing();
[download]

Note: I posted on June 6th, but then I found out that the code was actually not working correctly... So I had to fix the code and update the Benchmark tests and conclusions on the 7th. Code is the last version I used.

Comment on Re^2: A more elegant way to filter a nested hash? Benchmarks! Select or Download Code


Perl: the Markov chain saw
	PerlMonks