http://qs321.pair.com?node_id=1215517

jimpudar has asked for the wisdom of the Perl Monks concerning the following question:

Hello brothers and sisters of the monastary,

I need a function which will filter a nested hash, removing any fields I'm not interested in.

For example, lets say my application uses the following data structure. The hash values in any given instance of this structure are variable, but the hash keys will always be the same.

my $source = { f1 => 'garbage', f2 => 'more garbage', f3 => 'important data', f4 => { this => 'sub hash', is => 'garbage' }, f5 => { f6 => 'more important data', f7 => { more => 'garbage', f8 => 'important data', }, f9 => 'garbage', }, f10 => [ 'important', 'data' ], f11 => [ 'more', 'garbage' ] };

I only am interested in the important data, and I want to remove all the garbage.

I came up with a data structure I could use to express the parts of the structure I'm interested in.

my $filter = { f3 => 1, f5 => { f6 => 1, f7 => { f8 => 1 } }, f10 => 1 };

Given this filter, I want the output of the function to be the following hash:

my $output = { f3 => 'important data', f5 => { f6 => 'more important data', f7 => { f8 => 'important data', } }, f10 => [ 'important', 'data' ], };

A key feature of this function is that it must accept any variable type of filter; data which is garbage in one case is important in another case.

I came up with the following function which seems to work well:

sub hash_filter { my $source = shift; my $filter = shift; my %output; foreach ( keys %$filter ) { if ( exists $source->{$_} ) { if ( ref $filter->{$_} eq 'HASH' ) { croak "bad filter: on '$_', expected HASH\n" unless ( ref $source->{$_} eq 'HASH' ); $output{$_} = hash_filter( $source->{$_}, $filter->{$_} ); } else { $output{$_} = $source->{$_}; } } } return \%output; }

I'm very happy with how it works, but since I mean for this function to be reusable I wanted to hear how you all would do it.

Is there a simpler way to do this with higher order functions like map and grep?

Thank you for your help,

Jim

πάντων χρημάτων μέτρον έστιν άνθρωπος.

Replies are listed 'Best First'.
Re: A more elegant way to filter a nested hash?
by shmem (Chancellor) on May 31, 2018 at 00:44 UTC
    Is there a simpler way to do this with higher order functions like map and grep?

    Of course you can transform a for-loop into a map/grep construct, populating the return value \%output in one fell stroke (why isn't that a reference in the first place, if that is what is returned?) - but I doubt you would save significant lines of code, or improve readability. Job well done, move on :-)

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Why is this getting so much up votes. People still advertising how to slice hashes using map and grep when the Perl language has more elegant ways to slice hashes.

      use strict ; use warnings ; use Data::Dumper ; my $source = { f1 => 'garbage', f2 => 'more garbage', f3 => 'important data', f4 => { this => 'sub hash', is => 'garbage' }, f5 => { f6 => 'more important data', f7 => { more => 'garbage', f8 => 'important data', }, f9 => 'garbage', }, f10 => [ 'important', 'data' ], f11 => [ 'more', 'garbage' ] }; my $filter = { f3 => 1, f5 => { f6 => 1, f7 => { f8 => 1 } }, f10 => 1 }; sub sliceTheWholeDarnThing { my $s = $_[0] ; my $f = $_[1] ; my $n = {} ; my @keysToGet = keys %{ $f } ; @{ $n }{ @keysToGet } = @{ $s }{ @keysToGet } ; foreach ( keys %{ $n } ) { if ( ref $n->{ $_ } eq 'HASH' ) { $n->{ $_ } = sliceTheWholeDarnThing( $s->{ $_ }, $f->{ $_ +} ) ; } } return $n ; } my $newHash = sliceTheWholeDarnThing( $source, $filter ) ; print Dumper( $newHash ) ;
        Why is this getting so much up votes.

        Reasons may include:

        1. first answer
        2. pointing out that foreach or map/grep are essentially the same thing (seee below)
        3. telling the OP that a working and readable solution is enough
        4. eating my own dog food by not writing too much (but see below)

        Hash slices are syntactic sugar for iterating over a set of keys. I'd suspect the code path for both being identical (see B, do benchmarks with a meaningful set). Elegance is in the eye of the beholder.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Thanks for the input!

        I do agree this is quite a bit more elegant. Hash slicing is very nicely suited to this problem.

        I will need to put together a set of test cases for this function anyways, so I will try out all the versions posted so far and report back on the Benchmark results.

        Best,

        Jim

        πάντων χρημάτων μέτρον έστιν άνθρωπος.

        I had the same idea, but you are effectively iterating twice over the keys if you slice first.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: A more elegant way to filter a nested hash?
by LanX (Saint) on May 31, 2018 at 01:16 UTC
    Looks good for me, map can't handle recursion.

    The normal approach for nested data is to write a recursive "walker" and that's what you did.

    The only tip I could probably have is trying hash slices as filters, but only if performance mattered.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      map can't handle recursion.

      Of course you can have a map block inside a function, and recurse into that from inside the map block:

      sub hash_filter { my $source = shift; my $filter = shift; return { map { ref $filter->{$_} eq 'HASH' ? ref $source->{$_} eq 'HASH' ? ($_, hash_filter( $source->{$_}, $filter->{$_} )) : croak "bad filter: on '$_', expected HASH\n" : ($_, $source->{$_}) } grep { exists $source->{$_} } keys %$filter } }

      which btw is the for loop of the OP rewritten in terms of map/grep.

      update: a terse version which eliminates shifting @_ and working directly on the arguments (which aren't altered by the function):

      sub hash_filter { return { map { ref $_[1]->{$_} eq 'HASH' ? ref $_[0]->{$_} eq 'HASH' ? ($_, hash_filter( $_[0]->{$_}, $_[1]->{$_} )) : croak "bad filter: on '$_', expected HASH\n" : ($_, $_[0]->{$_}) } grep { exists $_[0]->{$_} } keys %{$_[1]} } }

      Now that's arguably micro-optimized and far less readable (11 lines) than the OP's code (18 lines) imho.

      More elegant? Perhaps for those who prefer nested ?: statements for simple expressions over if/else blocks...

      update 2: B::Concise shows identical optrees (execpt line numbers) for the "for" and "map/grep" solutions.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Thanks for the example of how this can be done. I was scratching my head trying to figure out how to handle the recursion with map.

        πάντων χρημάτων μέτρον έστιν άνθρωπος.

Re: A more elegant way to filter a nested hash?
by bliako (Monsignor) on May 31, 2018 at 14:04 UTC

    Interesting problem. Here are some tangential thoughts,

    Your hash_filter() at the leaf nodes of the source hash copies the corresponding scalar values. Instead it could just get a reference to them so that the result of the filter will be live-linked with source hash. If such a feature is desirable. Maybe something like: $output{$_} = \$source->{$_}; A scenario would be that you create an empty hash with certain fixed structure, live-linked filter it and then whenever you populate it (assuming data of the same structure), results "miraculously" will pop out at the other end of the live-linked filtered results! Albeit they will be overwritten by the next time of use.

    Then, for each different data structure you may have its own live-linked filter thus eliminating parsing the source hash every time.

    Secondly, in your code you may want to keep a cache storing filter results and keyed on source+filter *keys* and overall nested hash structure. This calls for a function which calculates a hash-code(=unique signature) of a hash-or-array based only on keys and nested-structure. Not values. It will be similar to your code and recursive. However, I can see no time savings to calculate a hash-code (every time) and retrieve from a cache the filtered result compared to just filter every time. But it would be cool to have such a function (I can write one if anyone is interested) and even cooler if it was incorporated into a special Hash/Array module which would uodate the hash-code at every insert/delete of keys (definetely not my cup of tea).

    Edit: Sumup: So the last suggestion is a way to see if two nested data structures comprising of hashes and arrays have THE SAME STRUCTURE. The challenge is to somehow make this as efficient as possible and definetely not to have to calculate it every time it is required given that the data structure has not changed.

      Interesting ideas. I especially like the first one.

      This got me thinking that it may be more performant to simply delete the values I don't need from the original hash instead of constructing a new one.

      For my current use case, this could be okay, but I could imagine future use cases where this would not be desirable...

      πάντων χρημάτων μέτρον έστιν άνθρωπος.

Re: A more elegant way to filter a nested hash?
by mr_ron (Chaplain) on May 31, 2018 at 19:22 UTC
Re: A more elegant way to filter a nested hash?
by pwagyi (Monk) on May 31, 2018 at 03:28 UTC

    You can improve by writing iterative version instead of recursive version. :)

      You can improve by writing iterative version instead of recursive version. :)

      Hi,

      What is improved with an iterative version? Elegance?

        The biggest savings is the call stack. In general, the overhead involved in moving into and out of subroutines is not negligible -- profile if you need to know the cost in your particular scenario. Iterative solutions with work queues can also be much more intelligible and may scope to your data more elegantly.

        However, from a true optimization perspective, if it runs and if it's documented, your effort is likely better spent improving a different chunk of code rather than worrying about aesthetics.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: A more elegant way to filter a nested hash? Benchmarks!
by jimpudar (Pilgrim) on Jun 02, 2018 at 14:13 UTC

    The results are in!

    Thank you all very much for your suggestions.

    I tried my hand at writing an iterative version, but it didn't perform as well as I'd hoped. Can anyone suggest a better implementation?

    Not surprisingly, the version which just deletes keys from the source hash is by far the fastest.

    More surprising to me was that the recursive version is quite a lot faster than both the map/grep and slice based solutions.

    I didn't spend much time trying to get mr_ron's version working. I may take a closer look at that one later this weekend.

    Let me know your thoughts!

    ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 not ok 7 # Failed test at ./benchmark.pl line 209. # Structures begin differing at: # $got->{f5}{f7} = 'f8' # $expected->{f5}{f7} = HASH(0x7fd932364b88) 1..7 Rate iterative slice map_grep map_grep_2 recursive + delete iterative 191571/s -- -3% -20% -23% -30% + -57% slice 198413/s 4% -- -17% -20% -27% + -56% map_grep 240385/s 25% 21% -- -3% -12% + -46% map_grep_2 248139/s 30% 25% 3% -- -9% + -45% recursive 271739/s 42% 37% 13% 10% -- + -39% delete 448430/s 134% 126% 87% 81% 65% + -- # Looks like you failed 1 test of 7.

    Best,

    Jim

    πάντων χρημάτων μέτρον έστιν άνθρωπος.

      Not surprisingly, the version which just deletes keys from the source hash is by far the fastest.

      No surprise here, because keys were deleted only once during single test run. In all the other 1e6 runs, the "else" branch was never reached. I'm not suggesting the benchmarks could be fixed, e.g. by deep-cloning the $t for each run for every contestant -- because I don't see the point, frankly. The OP uses word "remove" twice, but apart from this hash_filter_delete in benchmarks, nothing was removed in the whole thread, but filtered shallow copies were constructed. Was that the intention? Some inconsistency, at least, in that strings are duplicated, but arrays cloned. The useful payload so tiny, compared to overhead of e.g. maintaining queue (pushing, shifting, anon arrays creation), and recursion depth so shallow, that iterative approach will show no advantages.

        Ah, can't believe I missed the keys were only being deleted once!

        hash_filter_delete is indeed the only function which modifies the source hash instead of constructing a new one. See this node for the inspiration.

        As for the tiny payload, the data I am actually using this hash_filter function on is of similar size and depth. I could try out some larger payloads with more levels of hash nesting to see if the iterative version does any better.

        Thanks,

        Jim

        πάντων χρημάτων μέτρον έστιν άνθρωπος.

      Hi jimpudar, thanks for posting those results. Although the tests are probably fine for the original problem that you posted I had the same concerns as vr. I think the payload is too small for testing so I started to do some testing myself.

      Just for the fun of it, I creating some tests that use much bigger hash structures in memory (my aim was memory sizes up to 1.2 to 1.8 Gb of data). Also I wanted some variance in the nature on how the hash is created. E.g. Hashes with lot's of values and fewer hashes vs. Hashes with lot's of hashes containing fewer values. And small filters vs. big filters

      Here are the results of my tests:

      1. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~300k hashes. Filter 70% hashes and 88% values of original hash. 10 test cycles. Memory usage approx. 1.4 Gb.

      Number of Hashes = 305798 Number of Values = 4892785 Number of Hashes Filtered = 268718 Number of Values Filtered = 3435051 Filtering 88% Hashes Filtering 70% Values s/iter map_grep_2 map_grep slice recursive iterati +ve map_grep_2 6.29 -- -1% -8% -10% -1 +6% map_grep 6.22 1% -- -7% -9% -1 +5% slice 5.81 8% 7% -- -2% - +9% recursive 5.68 11% 10% 2% -- - +7% iterative 5.27 19% 18% 10% 8% +--

      iterative seems to perform very well under these circumstances. I ran this test multiple times, occasionally recursive or slice wins

      2. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~400k hashes. Filter 50% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.7 Gb.

      Number of Hashes = 405929 Number of Values = 6494881 Number of Hashes Filtered = 203883 Number of Values Filtered = 660915 Filtering 50% Hashes Filtering 10% Values s/iter iterative map_grep map_grep_2 slice recursi +ve iterative 1.67 -- -1% -1% -7% -1 +2% map_grep 1.65 1% -- -0% -6% -1 +1% map_grep_2 1.65 1% 0% -- -6% -1 +0% slice 1.55 7% 6% 6% -- - +5% recursive 1.48 13% 12% 12% 5% +--

      3. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.6M hashes. Filter 87% hashes and 69% values of original hash. 10 test cycles. Memory usage approx. 2.4 Gb.

      Number of Hashes = 1579338 Number of Values = 6317357 Number of Hashes Filtered = 1369612 Number of Values Filtered = 4362177 Filtering 87% Hashes Filtering 69% Values s/iter map_grep_2 slice map_grep recursive iterati +ve map_grep_2 10.7 -- -0% -3% -6% - +6% slice 10.7 0% -- -3% -6% - +6% map_grep 10.4 3% 3% -- -3% - +3% recursive 10.1 6% 6% 3% -- - +0% iterative 10.0 7% 7% 3% 0% +--

      4. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.2M hashes. Filter 40% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.5 Gb.

      Number of Hashes = 1226812 Number of Values = 4907253 Number of Hashes Filtered = 492351 Number of Values Filtered = 509410 Filtering 40% Hashes Filtering 10% Values s/iter iterative slice map_grep map_grep_2 recursi +ve iterative 2.56 -- -16% -17% -17% -2 +2% slice 2.15 19% -- -1% -1% - +7% map_grep 2.13 20% 1% -- -0% - +6% map_grep_2 2.12 20% 1% 0% -- - +6% recursive 1.99 28% 8% 7% 6% +--

      The last test shows recursive fast, however tests 2 and 4 are as far as I am concerned less interesting since they run a shorter time and only show a difference in a couple of tenth of seconds.

      After testing I think that the recursive function is still good solution, although my tests show that it is actually iterative that won the most complex schemes. Some of the algorithms seem to run slower or faster under different circumstances. I generated different hash structures and filter structures and I think that each of these differences may be more or less beneficial to the different algorithms. This I find more important because as far as I am concerned choosing the best algorithm should always be done on based on the biggest time loss.

      Eventually I have to conclude, what are we actually optimizing? Readable code vs. a win in one or two seconds for processing gigabytes of data?

      Sorry for not adding mr_ron's tests. I couldn't get it to work

      I did optimize the slice method though, it had an extra keys instruction that was not needed, but I also added the check for valid filter:

      sub hash_slice { my $n ; my @keysToGet = keys %{$_[1]}; @{$n}{@keysToGet} = @{$_[0]}{@keysToGet}; foreach ( @keysToGet ) { if ( ref $_[1]->{ $_ } eq 'HASH' ) { croak "bad filter: on '$_', expected HASH\n" unless ( ref +$_[0]->{ $_ } eq 'HASH' ) ; $n->{$_} = hash_slice( $_[0]->{$_}, $_[1]->{$_} ); } } return $n ; }

      Here's the code that I used:

      Note: I posted on June 6th, but then I found out that the code was actually not working correctly... So I had to fix the code and update the Benchmark tests and conclusions on the 7th. Code is the last version I used.

        Very nice, thanks for putting this together!

        I think we have proved that for 99% of the time, the recursive solution is best as it is the easiest to understand and maintain.

        Best,

        Jim

        πάντων χρημάτων μέτρον έστιν άνθρωπος.

[Perl 6] Re: A more elegant way to filter a nested hash?
by raiph (Deacon) on Jun 09, 2018 at 19:28 UTC
    With apologies if you're not interested in a P6 solution. I was and thought the one that is natural in P6 was worth writing up, including its downsides and complications.

    Given your data code exactly as it's written, one could write this parallel iterator in P6 code and given correctly formed data it should work to any depth:

    sub infix:<landr> ($l, $r) { $l and $r }; say $filter «landr» $source
    Run by the Rakudo Perl 6 compiler it would display:
    {f10 => [important data], f3 => important data, f5 => {f6 => more impo +rtant data, f7 => {f8 => important data}}}
    While I think it nicely addresses the elegance focus of your OP, other issues obviously arise.

    Before touching heavier aspects, I'll get a couple lighter ones out of the way. First, I didn't add any validation of the input. I haven't even considered it. Second, the above code constructs a new data structure rather than pruning the existing $source. Perhaps that's not desirable if the structure is extremely large. I'd expect there to be some P6 way to iterate over two data structures in parallel, mutating one of them en passant, but it may not be as elegant, and again I haven't even considered it. The above at least illustrates one conceptually/syntactically elegant approach to the problem in P6 and I've decided to stop there because there are obviously heavier issues...

    The elephant in the room is of course that it's P6, not P5. The rest of this comment discusses the two most important elements of this latter aspect, namely plausible integration into a P5 environment, and performance.

    Plausible integration of the above code into a P5 environment

    While the technical / deployment aspects of integrating P6 into a P5 environment aren't trivial, the much bigger and thornier issues in most situations are social, not technical. But I'm not going to address social issues in this post. Either a reader is interested in mixing P6 into a P5 environment or they aren't and I'm going to assume they are and that the only useful thing I can provide under this heading is an explanation of the two normal options of how to do so.

    One way to integrate the two is to add a use Inline::Perl6; statement to P5 code. (The linked module hasn't been updated since December 2016 but I do not think that reflects lack of interest in maintenance on Stefan Seifert's part so much as that the documentation is still valid. Yes, P5's IP6 syntax is very primitive compared to P6's P5. See next paragraph and perhaps consider encouraging Stefan to develop P5's IP6 to catch up with P6's IP5 sweetness. IP6 relies on the regularly updated Inline::Perl5 to do its thing so bug fixes, performance, semantics, etc. should be basically the same.

    The usual way to integrate P5 and P6 is instead to reverse the approach by adding a use Inline::Perl5; statement to P6 code. The underlying tech is the same. As far as I know the only substantive difference is the highly polished syntactic sugar available doing it this second way around.

    Rakudo hyper op performance

    Ignoring the many other issues with use of P6, it's plausible that any P6 solution is currently unacceptably slow compared to any P5 solution because most Rakudo operations are currently significantly slower than their P5 cousins.

    Rakudo is already fast enough for some production use cases for some users. And it has begun the long hard slog of reaching faster performance than P5 with some success for a handful of things. But I think it'll be years before its progress is sufficiently compelling to quiet the grinches. Also, happily, perl (5) is getting significantly faster at some of its basic operations each year and there's at least one plausibly serious performance oriented alternate P5. (We may well be about to witness an interesting long haul race between old and new Perls, Pythons, and Rubys throughout the 2020s.)

    In theory, use of hyperops (« and » or their ascii equivalents) in P6 will one day suddenly jump to become one or several orders of magnitude faster for some use cases. This is because they're semantically defined to be done in parallel and the compiler is architected to take advantage of that (though does not yet do so).

    While there may be an order of magnitude jump or so if the target hardware is multiple cores, it could well be several orders if the target is a GPU.

    I doubt the compiler would ever parallelize for your use cased based purely on heuristics. These heuristics will presumably only even consider attempting parallelization of "hot" code.

    (I say "attempting parallelization" because I'm expecting the heuristic approach to be speculative, backed by the same deoptimization mechanism that backs the new speculative static analysis that is just recently beginning to be applied -- see ‎How does deoptimization help us go faster? which is a link to a video that jumps right to the point where Jonathan Worthington delivers his presentation's punchline.)

    As I understand it, code would have to be repeated at least several hundred times before it would be considered hot. Your use case seems to be a one-time-per-run deal so speculative parallelization would not apply.

    Instead, I'm anticipating compiler directives, perhaps in the form of pragmas like use hyper <NUMA>; or use GPU; or similar, that direct Rakudo to unconditionally parallelize if the appropriate hardware is present.

    As I understand things, this parallelization could be attempted today but tuits are currently focused elsewhere and likely to remain so for the next few years. So this is likely waiting until someone with sufficient C chops and a fearless attitude jumps in and makes it happen.

      Very cool. I'll have to do some reading to figure out how this works.

      Thanks for the writeup!

      Jim

      πάντων χρημάτων μέτρον έστιν άνθρωπος.

      With apologies if you're not interested in a P6 solution.

      As long as you're writing code in languages other than the one the OP presented, would you mind writing up the Inline Haskell version as well? Nested zippers are interesting. Otherwise, I'd personally like to see the Inline Javascript version using the new ES6 destructuring syntax.

        I would suggest that people who know how to present a certain algorithm, would do so in the language they are fluent in. Since you appear you know what you're talking about, I would suggest that you provide the Haskell and Inline Javascript versions. So that we can all learn from it, rather than getting a lesson in being sarcastic.
Re: A more elegant way to filter a nested hash? ( data xpath, dpath, diver, walk, walker, visitor, jquery, pquery, json path, css csel, MarpaX::xPathLike)
by Anonymous Monk on May 31, 2018 at 08:51 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.