http://qs321.pair.com?node_id=11113796


in reply to Re^2: Detecting whether UV fits into an NV
in thread Detecting whether UV fits into an NV

Rob:

I'm not experienced in using Inline::C, but I thought I'd take a look at it. My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower, so I tweaked the routine to explicitly pull arg into a UV as you did in double3:

$ diff pm_11113750_orig.pl pm_11113750_a.pl 26a27 > UV arg; 30,31c31,33 < IV neg_t = -(SvIV(ST(i))); < IV last_set = SvUV(ST(i)) & neg_t; --- > arg = SvUV(ST(i)); > IV neg_t = -arg; > IV last_set = arg & neg_t; 42c44 < if(!(SvUV(ST(i)) & invalid_bits)) count++; --- > if(!(arg & invalid_bits)) count++;

Doing that change alone made the routines roughly the same speed:

$ PERL5OPT= perl pm_11113750_orig.pl Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf +iddle... uv_fits_double3: 7 wallclock secs ( 6.70 usr + 0.00 sys = 6.70 CPU) + @ 0.75/s (n=5) uv_fits_double_bitfiddle: 17 wallclock secs (17.36 usr + 0.00 sys = 1 +7.36 CPU) @ 0.29/s (n=5) 46875 46875 $ PERL5OPT= perl pm_11113750_a.pl Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf +iddle... uv_fits_double3: 7 wallclock secs ( 6.75 usr + 0.00 sys = 6.75 CPU) + @ 0.74/s (n=5) uv_fits_double_bitfiddle: 6 wallclock secs ( 6.67 usr + 0.00 sys = +6.67 CPU) @ 0.75/s (n=5) 46875 46875

Since the twiddle part doesn't have any loops, I would expect it to be significantly faster than a version *with* loops, depending on how clever the optimizer is. So another version I played with was to modify the routine to take two parameters and count the number of values that fit between those two values. That way, I could greatly reduce the impact of the SvUV() function at the cost of only being able to process sequences of numbers. That version confirmed my suspicion that the SvUV() function is slow enough that it swamps the difference in our algorithms.

int uv_fits_double3x(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); UV i; int count = 0; UV arg; for(i = i_min; i < i_max; i++) { arg = i; if(arg) { while(!(arg & 1)) arg >>= 1; if(arg < 9007199254740993) count++; } } return count; } int uv_fits_double_bitfiddle_3x(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); UV i; int count = 0; for(i = i_min; i < i_max; i++) { UV arg = i; IV neg = -arg; IV smallest_invalid = (arg & -arg)<<53; UV valid_bits = smallest_invalid-1; UV invalid_bits = ~valid_bits; if (! (arg & invalid_bits)) count++; } return count; }

Both routines are much faster if they only take two arguments and iterate over the entire range. Only then does the twiddle version show a significant speed advantage at about twice as fast. The double3x version is about 1800 times faster, while fiddle3x is about 2900 times faster:

$ PERL5OPT= perl pm_11113750_x.pl Name "main::count4" used only once: possible typo at pm_11113750_x.pl +line 147. === Test 1: roughly equivalent speed Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf +iddle... uv_fits_double3: 28 wallclock secs (27.22 usr + 0.08 sys = 27.30 CPU) + @ 0.18/s (n=5) uv_fits_double_bitfiddle: 27 wallclock secs (26.72 usr + 0.00 sys = 2 +6.72 CPU) @ 0.19/s (n=5) 33046878 33046878 (48000004) === TEST 2: twiddle has an advantage if we don't use SvUV() Benchmark: timing 10000 iterations of bitfiddle_3x, double3x, empty_fu +nc, empty_loop... bitfiddle_3x: 18 wallclock secs (18.39 usr + 0.00 sys = 18.39 CPU) @ +543.74/s (n=10000) double3x: 30 wallclock secs (29.55 usr + 0.00 sys = 29.55 CPU) @ 33 +8.44/s (n=10000) empty_func: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) empty_loop: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count) 33046875 33046875 48000000 1416858175

Unfortunately, I've not been successful in making a useful version that's faster in a real-world scenario. My current understanding of things (after playing with it for a few hours) is that the single argument versions get swamped out by the looping and function calling overhead. The functions are just too simple to be aHowever, I've not been successful in making a useful version that's significantly faster (with call overhead and such).

NOTES: As you can tell from the above, I've not been successful in measuring the function call (empty func) or empty loop versions of the functions, so my benchmarking could be broken, so I'm including the current code, so people can point things out.

use warnings; use Benchmark; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'EOC'; int uv_fits_double3(SV * x, ...) { dXSARGS; int i, count = 0; UV arg; for(i = 0; i < items; i++) { arg = SvUV(ST(i)); if(arg) { while(!(arg & 1)) arg >>= 1; if(arg < 9007199254740993) count++; } } return count; } int uv_fits_double_bitfiddle(SV * t, ...) { dXSARGS; int i, count = 0; UV arg; for(i = 0; i < items; i++) { arg = SvUV(ST(i)); IV smallest_invalid = (arg & -arg) << 53; UV valid_bits = smallest_invalid - 1; UV invalid_bits = ~valid_bits; if ( !(arg & invalid_bits)) count++; } return count; } int uv_fits_double3x(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); UV i; int count = 0; UV arg; for(i = i_min; i < i_max; i++) { arg = i; if(arg) { while(!(arg & 1)) arg >>= 1; if(arg < 9007199254740993) count++; } } return count; } int uv_fits_double_bitfiddle2(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); UV i; int count = 0; for(i = i_min; i < i_max; i++) { UV arg = i; IV neg = -arg; IV smallest_invalid = (arg & -arg)<<53; UV valid_bits = smallest_invalid-1; UV invalid_bits = ~valid_bits; if (! (arg & invalid_bits)) count++; } return count; } int uv_empty_func(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); return i_max - i_min; } int uv_empty_loop(SV* the_min, SV* the_max) { dXSARGS; UV i_min = SvUV(the_min); UV i_max = SvUV(the_max); UV i; int count = 0; for(i = i_min; i < i_max; i++) { UV arg = i; // boring count++; } return count * i_max - i_min; } EOC @in2 = ( [ 1844674407366955161, 1844674407378955161 ], [ 9007199248740992, 9007199260740992 ], [ 184467436737095, 184467448737095 ], [ 184463440737, 184475440737 ], ); push @in, $_->[0] .. $_->[1] for @in2; print "=== Test 1: roughly equivalent speed\n"; our ($count1, $count2); ($count1, $count2) = (0, 0); timethese (5, { 'uv_fits_double3' => '$count1 = uv_fits_double3(@in);', 'uv_fits_double_bitfiddle' => '$count2 = uv_fits_double_bitfiddle( +@in);', }); print "$count1 $count2 (", scalar(@in), ")\n"; print "!!!! MISMATCH !!!!\n" if $count1 != $count2; print "\n=== TEST 2: twiddle has an advantage if we don't use SvUV()\n +"; ($count1, $count2, $count3) = (0, 0, 0, 0); timethese (10000, { 'double3x' => '$count1 = uv_fits_double3x(@{$in2[0]})' .' + uv_fits_double3x(@{$in2[1]})' .' + uv_fits_double3x(@{$in2[2]})' .' + uv_fits_double3x(@{$in2[3]})' , 'bitfiddle_3x'=>'$count2 = uv_fits_double_bitfiddle2(@{$in2[0]})' .' + uv_fits_double_bitfiddle2(@{$in2[1]})' .' + uv_fits_double_bitfiddle2(@{$in2[2]})' .' + uv_fits_double_bitfiddle2(@{$in2[3]})' , 'empty_func'=>'$count3 = uv_empty_func(@{$in2[0]})' .' + uv_empty_func(@{$in2[1]})' .' + uv_empty_func(@{$in2[2]})' .' + uv_empty_func(@{$in2[3]})' , 'empty_loop'=>'$count4 = uv_empty_loop(@{$in2[0]})' .' + uv_empty_loop(@{$in2[1]})' .' + uv_empty_loop(@{$in2[2]})' .' + uv_empty_loop(@{$in2[3]})' , }); print "$count1 $count2 $count3 $count4\n"; print "!!!! MISMATCH !!!!\n" if $count1 != $count2;

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^4: Detecting whether UV fits into an NV
by syphilis (Archbishop) on Mar 05, 2020 at 01:38 UTC
    My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower

    Yes, I thought of replacing them with a variable, but decided there wouldn't be that much difference between looking at the value of an SV's IV slot and looking at the value of an IV.
    I guess for a few calls there's not much difference, but when you're making 36 million of them it's not hard to believe that things might add up - and I should have thought that through a little better. (Actually, a "lot better".)

    Fixing that alone makes uv_fits_double_bitfiddle almost twice as fast as uv_fits_double3 for me:
    Benchmark: timing 1 iterations of uv_fits_double3, uv_fits_double_bitf +iddle... uv_fits_double3: 1 wallclock secs ( 0.45 usr + 0.00 sys = 0.45 CPU) + @ 2.21/s (n=1) (warning: too few iterations for a reliable count) uv_fits_double_bitfiddle: 0 wallclock secs ( 0.25 usr + 0.00 sys = +0.25 CPU)@ 4.02/s (n=1) (warning: too few iterations for a reliable count)
    This is pretty much the type of approach whose existence I had wondered about.
    It had never been pointed out to me that iv & -iv would identify the least significant set bit, and I'm certainly not sharp enough to have ever realized it myself.
    This method is just brilliant ... and it's great that it turns out to be faster, too !!
    I'll certainly be using it (with due accreditation to you) unless further testing, contrary to my expectations, reveals some problem with it.

    Cheers,
    Rob

      Rob:

      Nice! I'm glad you got a usful result.

      Regarding thinking of that trick: I didn't think of it either, but years ago, I read a document that had a lot of bit-fiddling tricks in it, and remembered that there was a trick for that, and my Google-fu let me dredge it up.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I didn't think of it either

        It's also worth pointing out that once you've isolated that critical bit, it's still not exactly braindead straightforward as to how best to make use of that info.
        Think of an integer, find its least significant set bit, left-shift its value 53 places, subtract 1, flip all of the bits, and then & that result with the number you first thought of.

        I made a small modification to the sub so that it handled negative and positive IV/UV values.
        I also compounded the guts of the code into 2 lines. (It could be put into 1 line, but I didn't go that far.)
        Here's what I ended up with ... minus explanatory comments.
        int iv_fits_double(SV * t, ...) { dXSARGS; int i, count = 0; for(i = 0; i < items; i++) { IV arg = SvIV(ST(i)); int sign = ( arg > 0 || SvUOK(ST(i)) ) ? 1 : -1; UV valid_bits = ((arg & -arg) << 53) - 1; if(!((arg * sign) & (~valid_bits))) count++; } return count; }
        Cheers,
        Rob