Well, I spent some more time thinking about the problem, and I came up with a slightly better algorithm. Instead of a recursive, depth-first approach, I used an iterative, breadth-first algorithm that I arrived at while trying to optimize my previous attempt into a tail-recursive scheme. Code and benchmarks follow.
#!/usr/bin/perl
use strict; use warnings;
use YAML::XS;
use Benchmark qw(cmpthese);
my $data;
{local $/;$data = <DATA>};
cmpthese( -5, {
recursive => sub {my $d = Load $data; mergekeys_recursive( $d
+); },
tail_call => sub {my $d = Load $data; mergekeys_tail( $d ); },
loop => sub {my $d = Load $data; mergekeys_loop( $d ); },
}
);
sub mergekeys_recursive
{
my ($ref) = @_;
my $type = ref $ref;
if ($type eq 'HASH')
{
my $tmphref = $ref->{'<<'};
if ($tmphref)
{
die "Merge key does not support merging non-hashmaps"
unless (ref $tmphref eq 'HASH');
my %tmphash = %$tmphref;
delete $ref->{'<<'};
%$ref = (%tmphash, %$ref);
}
mergekeys_recursive($_) for (values %$ref);
}
elsif ($type eq 'ARRAY')
{
mergekeys_recursive($_) for (@$ref);
}
return $ref;
}
sub mergekeys_tail
{
my ($ref) = (@_);
_mergekeys($ref);
return $ref;
}
sub _mergekeys
{
my $ref = shift or return;
my $type = ref $ref;
if ($type eq 'HASH')
{
my $tmphref = $ref->{'<<'};
if ($tmphref)
{
die "Merge key does not support merging non-hashmaps"
unless (ref $tmphref eq 'HASH');
my %tmphash = %$tmphref;
delete $ref->{'<<'};
%$ref = (%tmphash, %$ref);
}
push @_, grep {ref eq 'HASH' or ref eq 'ARRAY'} values %$ref;
}
elsif ($type eq 'ARRAY')
{
push @_, grep {ref eq 'HASH' or ref eq 'ARRAY'} @$ref;
}
goto &_mergekeys;
}
sub mergekeys_loop
{
my ($orig) = @_;
while (my $ref = shift)
{
my $type = ref $ref;
if ($type eq 'HASH')
{
my $tmphref = $ref->{'<<'};
if ($tmphref)
{
die "Merge key does not support merging non-hashmaps"
unless (ref $tmphref eq 'HASH');
my %tmphash = %$tmphref;
delete $ref->{'<<'};
%$ref = (%tmphash, %$ref);
}
push @_, grep {ref eq 'HASH' or ref eq 'ARRAY'} values %$r
+ef;
}
elsif ($type eq 'ARRAY')
{
push @_, grep {ref eq 'HASH' or ref eq 'ARRAY'} @$ref;
}
}
return $orig;
}
__DATA__
Data I use for benchmark:
---
key1: &id05
- &id02
name: Curly
lastname: Howard
hair: no
occupation: stooge
-
<<: *id02
name: Larry
lastname: Fine
hair: curly
-
<<: *id02
name: Moe
hair: bowl
-
<<: *id02
name: Shemp
hair: yes
on_again: off_again
key2:
subkey_a:
- foo
- phoo
- ghoo
-
- [[[[[[[[[[[[[[asdf]]]]]]],1],3]],{whatever: works}]]]]
-
sounds:
- &id03
voice: hollow
says: plugh
colors:
- red
- green
- blue
-
<<: *id03
voice: wind
says: do you hear what i hear
-
<<: *id03
voice: stooge
says:
- nyuk
- nyuk
- nyuk
characters: *id05
- zxcv
subkey_b:
- &id01
name: bar
type: variable
weather: sunny
-
<<: *id01
name: baz
hometown: Perth
...
And the results of the benchmark:
Rate recursive tail_call loop
recursive 951/s -- -6% -14%
tail_call 1016/s 7% -- -9%
loop 1111/s 17% 9% --
I was surprised that the loop was that much faster than the tail-recursion, since it amounts to about the same thing (go to top of loop, check condition, execute or return). It is important to note, though, that I was only able to replace the goto with a while loop because my function was intended to be run for its side-effects on the data structure. If I needed to accumulate a return value, the goto would have been the better solution.
print map{(split//,'hark, suPerJacent other l')[$_]}(11,7,6,16,5,1,15,18..23,8..10,24,17,0,12,13,3,14,2,4);