Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

"exists $hash{key}" is slower than "$hash{key}"

by swl (Curate)
on Jan 06, 2020 at 00:23 UTC ( #11111015=perlmeditation: print w/replies, xml ) Need Help??

UPDATE 2020-01-10: Actually, it's not. See subthread starting at 11111117.

----

I decided to run some benchmarking on hash exists after some code profiling showed a reasonable amount of time spent on lines with next if exists $hash{$key}.

This is largely in the context of code structured like the (very contrived) example below which uses the common idiom of skipping slow code if it has already been done or is not needed based on a tracking hash.

my %done; my @data = (1..100); for (1..100) { push @data, int (rand() * 100); } for my $item (@data) { next if exists $done{$item}; # do something time consuming # ... $done{$item}++; }

The code below tries combinations of exists and value checking. Assignment to variables is used to avoid "Useless use of hash element in void context" warnings, and the assignment to globals is to get a sense of how much the timings are related to bookkeeping of lexicals. I could disable warnings but it's the relative timing differences that are useful here, not the absolute times.

use Benchmark qw {:all}; use 5.016; my %hash; # set up the hash for (1001..2000) { $hash{$_}++; } # two keys we use below our $key1 = 1001; our $key2 = 1002; # hash key 1 is SV $hash{$key1} = 1; # hash key 2 is RV $hash{$key2} = {1..10}; # assign to global as a baseline our $xx_global; # keys are short so the timing table is not too wide # char 1: e = exists check, # v = value check # chars 2,3: ck = constant key, # vk = variable key # chars 4,5: sv = key contains scalar value, # rv = key contains reference # char 6: l = assign to lexical, # g = assign to global # thus # ecksvl means "exists check # using constant key # containing a scalar value, # assigned to lexical" # (the value is clearly redundant for an exists check, # but is retained for completeness) my %checks = ( ecksvl => 'my $x = exists $hash{1001}', evksvl => 'my $x = exists $hash{$key1}', vcksvl => 'my $x = $hash{1001}', vvksvl => 'my $x = $hash{$key1}', evksvg => '$xx_global = exists $hash{$key1}', vvksvg => '$xx_global = $hash{$key1}', eckrvl => 'my $x = exists $hash{1002}', evkrvl => 'my $x = exists $hash{$key2}', vckrvl => 'my $x = $hash{1002}', vvkrvl => 'my $x = $hash{$key2}', evkrvg => '$xx_global = exists $hash{$key2}', vvkrvg => '$xx_global = $hash{$key2}', ); cmpthese ( -3, \%checks );

Code was run using Strawberry perl 5.28.0, and the results are given in the table below (see code for key explanation).

The main take home is that the value checks (v prefix) are all faster than the exists checks (e prefix). Assigning to global is faster, presumably because there is less bookkeeping involved, but it will be rare that one would use such a construct anyway.

Rate evksvl evkrvl ecksvl eckrvl evksvg evkrvg vvksvl vvk +rvl vckrvl vcksvl vvksvg vvkrvg evksvl 10733145/s -- -5% -7% -12% -15% -18% -29% - +31% -32% -33% -41% -48% evkrvl 11290643/s 5% -- -2% -7% -10% -14% -25% - +27% -28% -29% -38% -45% ecksvl 11570664/s 8% 2% -- -5% -8% -12% -23% - +25% -27% -27% -36% -44% eckrvl 12176232/s 13% 8% 5% -- -3% -8% -19% - +21% -23% -23% -33% -41% evksvg 12572221/s 17% 11% 9% 3% -- -5% -17% - +19% -20% -21% -31% -39% evkrvg 13168623/s 23% 17% 14% 8% 5% -- -13% - +15% -17% -17% -28% -36% vvksvl 15082826/s 41% 34% 30% 24% 20% 15% -- +-2% -4% -5% -17% -27% vvkrvl 15461840/s 44% 37% 34% 27% 23% 17% 3% + -- -2% -3% -15% -25% vckrvl 15777625/s 47% 40% 36% 30% 25% 20% 5% + 2% -- -1% -13% -23% vcksvl 15909705/s 48% 41% 38% 31% 27% 21% 5% + 3% 1% -- -13% -23% vvksvg 18207860/s 70% 61% 57% 50% 45% 38% 21% +18% 15% 14% -- -12% vvkrvg 20580512/s 92% 82% 78% 69% 64% 56% 36% +33% 30% 29% 13% --

So why is it that exists is slower than checking the value? My starting assumption was that exists should be faster, as getting a value requires checking that it exists first. However, looking that the source code, most of the hash key and value calls are passed through the same function, hv_common. So far as I can tell from reading the code, and based on my limited comprehension of the details, hv_common prioritises getting values over checking key existence and value assignment.

So does this all matter and should code that uses exists $hash{$key} be changed to use $hash{$key}? Given that even the slowest of the benchmark snippets is running more than 10,000,000 per second, it does not matter at all for most use cases. One would need to be running hundreds of millions of calls for such a change to start to make a meaningful difference, and some would quite reasonably argue that billions of calls are needed.

Maybe the perl source code could be optimised so exists is not slower, but whether this justifies any additional maintenance burden is not something I can answer.

Replies are listed 'Best First'.
Re: "exists $hash{key}" is slower than "$hash{key}"
by choroba (Archbishop) on Jan 06, 2020 at 01:01 UTC
    Unfortunately, you can't access the lexical hash variable from a string-evaled code in Benchmark, so the benchmarks measures nothing.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Benchmark qw{ cmpthese }; my %hl; @hl{1001 .. 2000} = (1) x 1000; our %hg = %hl; our ($el, $vl, $eg, $vg) = (0) x 4; die unless $hl{1001}; die unless $hg{1001}; cmpthese(-1, { exist_l => q{ ++$el if exists $hl{1001} }, value_l => q{ ++$vl if $hl{1001} }, exist_g => q{ ++$eg if exists $hg{1001} }, value_g => q{ ++$vg if $hg{1001} }, }); say join "\n", "el: $el", "vl: $vl", "eg: $eg", "vg: $vg";
    5.26.1 on Linux:
    Rate value_g exist_g value_l exist_l value_g 17625636/s -- -4% -47% -53% exist_g 18303545/s 4% -- -45% -51% value_l 33363781/s 89% 82% -- -11% exist_l 37417325/s 112% 104% 12% -- el: 0 vl: 0 eg: 23229990 vg: 23229990
    Blead gives smaller differences, but the order is the same.

    Update: Changing q{ to sub { changes the order randomly and makes the difference less than 10%.

    Update 2: Changing 1001 to 2001 in the benchmarked code (i.e. testing non-existent key) changes the differences to 20% and less, e.g.

    Rate value_g exist_g value_l exist_l value_g 31977080/s -- -2% -8% -19% exist_g 32733979/s 2% -- -6% -17% value_l 34737228/s 9% 6% -- -12% exist_l 39658833/s 24% 21% 14% --

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      If you want to benchmark code using lexical variables, but need to ameliorate the overhead of calling a sub, usually the easiest way is to wrap the code you're benchmarking in a loop that gets executed thousands of times...

      #!/usr/bin/perl use warnings; use strict; use Benchmark qw{ cmpthese }; my %h; @h{1001 .. 2000} = (1) x 1000; my ($e, $v) = (0) x 2; cmpthese(-1, { exist => sub { for (0..999_999) { ++$e if exists $h{1001} } }, value => sub { for (0..999_999) { ++$v if $h{1001} } }, }); __DATA__ Rate value exist value 18.3/s -- -5% exist 19.3/s 5% --

        Thanks for this. The interesting thing is that if I stringify the code to avoid the sub call then I can replicate the ordering from my original post.

        I added a false value to get a sense of the cost of the increment operation. value_true is value from your version, and is obviously faster as it does less work.

        use warnings; use strict; use Benchmark qw{ cmpthese }; my %h; @h{1001 .. 2000} = (1) x 1000; my ($e, $vt, $vf) = (0) x 3; $h{1002} = 0; cmpthese(-2, { exist => sub { for (0..999_999) { ++$e if exists $h{1001} } }, exist_str => 'for (0..999_999) { ++$e if exists $h{1001} }', value_true => sub { for (0..999_999) { ++$vt if $h{1001} } }, value_str => 'for (0..999_999) { ++$vt if $h{1001} }', value_false => sub { for (0..999_999) { ++$vf if $h{1002} } }, }); __DATA__ Rate exist value_true exist_str value_str val +ue_false exist 16.1/s -- -9% -23% -26% + -27% value_true 17.7/s 10% -- -15% -18% + -20% exist_str 20.9/s 29% 18% -- -4% + -6% value_str 21.7/s 34% 22% 4% -- + -2% value_false 22.1/s 37% 25% 6% 2% + --

      Thanks for this.

      I'm not sure why not being able to access the lexical matters. My objective is simply to isolate code as much as possible so the relative differences are down to the exists/value checks. That said, I used the globals because I hit an issue with incrementing package lexicals in the stringified code when it was run by the benchmark module, but probably just did something incorrect when setting those up.

      The effect of the non-existent key is interesting. I'll add it to my benchmarks.

      WRT sub calls, there seem to still be overheads that affect the timings (or my benchmarks need more work) - see my response to tobyink at 11111089.

Re: "exists $hash{key}" is slower than "$hash{key}"
by GrandFather (Sage) on Jan 06, 2020 at 05:57 UTC

    Even if real, when is a 50ns difference in an "exists" operation significant?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      <sarc weight="0.25"> Maybe when you do it 10 billion times . . .

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

        Boiler pressure in the steam engine driving your computer is going to make more difference than that! Adding a few megs or RAM or an SSD will completely mask any difference.

        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: "exists $hash{key}" is slower than "$hash{key}"
by LanX (Archbishop) on Jan 06, 2020 at 00:42 UTC
    Hi

    Just a very wild guess (Disclaimer: I didn't check your benchmark for possible holes)

    It seems to me that exists is just implemented on top of retrieving the value.

    IOW the opcode for exists gets the value but throws it away, which makes it more expensive than a normal look-up.

    C:\WINDOWS\system32>perl -MO=Concise -E"exists $h{a}" 4 <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 74 -e:1) v:%,{,469764096 ->3 - <1> ex-exists vK/1 ->4 3 <+> multideref($h{"a"}) vK/EXISTS ->4 - <0> ex-gv s ->3 -e syntax OK C:\WINDOWS\system32>perl -MO=Concise -E"$h{a}" 4 <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 74 -e:1) v:%,{,469764096 ->3 - <1> ex-helem vK/2 ->4 3 <+> multideref($h{"a"}) vK ->4 - <0> ex-gv s ->3 -e syntax OK

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      If this is the case then it seems counter intuitive at first glance, and maybe is something that could be modified.

      One advantage of using exists $hash{$key} is that the hash values can then all be assigned undef, potentially saving memory. (Although one could also create a scalar ref and re-use it, e.g. my $flag = \1 - this is what is used in Hash::Ordered as a tombstone value.

        > If this is the case then it seems counter intuitive at first glance, and maybe is something that could be modified.

        Well - "if this is the case" - it's most probably for "historical reasons".

        I could imagine exists came later and that was the easiest way to implement it, and the pay off for any redesign is too small.

        You shouldn't forget that there is much more going on in the background which would need to be replicated, like autovivification and so on.

        > One advantage of using exists

        I'm not sure I can follow.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://11111015]
Approved by choroba
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2020-04-04 14:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (32 votes). Check out past polls.

    Notices?