### Re^2: Detecting whether UV fits into an NV

by syphilis (Bishop)
 on Mar 04, 2020 at 07:48 UTC Need Help??

in reply to Re: Detecting whether UV fits into an NV
in thread Detecting whether UV fits into an NV

I was originally thinking of making a lookup table, but thought of this along the way.

Thanks for that - I still haven't worked my way through it properly.
It seems to work fine now that I've changed:
```long long smallest_invalid = last_set << 52;
to
long long smallest_invalid = last_set << 53;
Without that change, the following values (for example) wrongly reported "NOPE":
```1844674407366955264,
1844674407366955776,
1844674407366956288,
1844674407366956800,
1844674407366957312,
1844674407366957824,
1844674407366958336,
1844674407366958848,
1844674407366959360,
1844674407366959872,
1844674407366960384,
1844674407366960896,
1844674407366961408,
1844674407366961920,
1844674407366962432,
1844674407366962944,
1844674407366963456,
1844674407366963968,
1844674407366964480,
1844674407366964992
In the following XS adaptation it's taking over twice as long as uv_fits_double3.
```use warnings;
use Benchmark;

use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

int uv_fits_double3(SV * x, ...) {
dXSARGS;
int i, count = 0;
UV arg;

for(i = 0; i < items; i++) {
arg = SvUV(ST(i));
if(arg) {
while(!(arg & 1)) arg >>= 1;
if(arg < 9007199254740993) count++;
}
}
return count;
}

int uv_fits_double_bitfiddle(SV * t, ...) {
dXSARGS;
int i, count = 0;

for(i = 0; i < items; i++) {
/* First we need to identify the lowest bit set */
IV neg_t    = -(SvIV(ST(i)));
IV last_set = SvUV(ST(i)) & neg_t;

/* Shift it left 53 bits to get location of lowest invalid bit
+     *
* NOTE: If smallest invalid bit is far enough left, then this wil
+l    *
* turn into 0 making the invalid bits also 0, which happens to be
+ OK! */
IV smallest_invalid = last_set << 53;

UV valid_bits = smallest_invalid - 1;
UV invalid_bits = ~valid_bits;

/* It's a 'valid' number unless one of the upper-order invalid bit
+s are set */
if(!(SvUV(ST(i)) & invalid_bits)) count++;
}
return count;
}

EOC

@in = (1844674407366955161 .. 1844674407378955161);
# @in = (9007199248740992 .. 9007199260740992);
# @in = (184467436737095 .. 184467448737095);
# @in = (184463440737 .. 184475440737);

die "wrong size" unless @in == 12000001;

timethese (1, {
'uv_fits_double3'    =>       '\$count1 = uv_fits_double3(@in);',
'uv_fits_double_bitfiddle' => '\$count2 = uv_fits_double_bitfiddle(@in)
+;',
});

print "\$count1 \$count2\n";

die "Error in at least one of the XSubs" unless \$count1 == \$count2;
__END__
Benchmarking results on my Windows 7 machine with perl-5.30.0 are:
```Benchmark: timing 1 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  0 wallclock secs ( 0.44 usr +  0.00 sys =  0.44 CPU)
+ @  2.29/s (n=1)
(warning: too few iterations for a reliable count)
uv_fits_double_bitfiddle:  1 wallclock secs ( 0.94 usr +  0.00 sys =
+0.94 CPU) @  1.07/s (n=1)
(warning: too few iterations for a reliable count)
12000001 12000001
Can you see room for significant improvement in performance of uv_fits_double_bitfiddle ?

Thanks for another interesting approach !!

Cheers,
Rob

Replies are listed 'Best First'.
Re^3: Detecting whether UV fits into an NV
by roboticus (Chancellor) on Mar 04, 2020 at 23:33 UTC

Rob:

I'm not experienced in using Inline::C, but I thought I'd take a look at it. My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower, so I tweaked the routine to explicitly pull arg into a UV as you did in double3:

```\$ diff pm_11113750_orig.pl pm_11113750_a.pl
26a27
>     UV arg;
30,31c31,33
<     IV neg_t    = -(SvIV(ST(i)));
<     IV last_set = SvUV(ST(i)) & neg_t;
---
>     arg = SvUV(ST(i));
>     IV neg_t    = -arg;
>     IV last_set = arg & neg_t;
42c44
<     if(!(SvUV(ST(i)) & invalid_bits)) count++;
---
>     if(!(arg & invalid_bits)) count++;

Doing that change alone made the routines roughly the same speed:

```\$ PERL5OPT= perl pm_11113750_orig.pl
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  7 wallclock secs ( 6.70 usr +  0.00 sys =  6.70 CPU)
+ @  0.75/s (n=5)
uv_fits_double_bitfiddle: 17 wallclock secs (17.36 usr +  0.00 sys = 1
+7.36 CPU) @  0.29/s (n=5)
46875 46875

\$ PERL5OPT= perl pm_11113750_a.pl
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  7 wallclock secs ( 6.75 usr +  0.00 sys =  6.75 CPU)
+ @  0.74/s (n=5)
uv_fits_double_bitfiddle:  6 wallclock secs ( 6.67 usr +  0.00 sys =
+6.67 CPU) @  0.75/s (n=5)
46875 46875

Since the twiddle part doesn't have any loops, I would expect it to be significantly faster than a version *with* loops, depending on how clever the optimizer is. So another version I played with was to modify the routine to take two parameters and count the number of values that fit between those two values. That way, I could greatly reduce the impact of the SvUV() function at the cost of only being able to process sequences of numbers. That version confirmed my suspicion that the SvUV() function is slow enough that it swamps the difference in our algorithms.

```int uv_fits_double3x(SV* the_min, SV* the_max) {
dXSARGS;
UV i_min = SvUV(the_min);
UV i_max = SvUV(the_max);
UV i;
int count = 0;
UV arg;

for(i = i_min; i < i_max; i++) {
arg = i;
if(arg) {
while(!(arg & 1)) arg >>= 1;
if(arg < 9007199254740993) count++;
}
}
return count;
}

int uv_fits_double_bitfiddle_3x(SV* the_min, SV* the_max) {
dXSARGS;
UV i_min = SvUV(the_min);
UV i_max = SvUV(the_max);
UV i;
int count = 0;

for(i = i_min; i < i_max; i++) {
UV arg = i;
IV neg = -arg;
IV smallest_invalid = (arg & -arg)<<53;
UV valid_bits = smallest_invalid-1;
UV invalid_bits = ~valid_bits;
if (! (arg & invalid_bits)) count++;
}
return count;
}

Both routines are much faster if they only take two arguments and iterate over the entire range. Only then does the twiddle version show a significant speed advantage at about twice as fast. The double3x version is about 1800 times faster, while fiddle3x is about 2900 times faster:

```\$ PERL5OPT= perl pm_11113750_x.pl
Name "main::count4" used only once: possible typo at pm_11113750_x.pl
+line 147.
=== Test 1: roughly equivalent speed
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3: 28 wallclock secs (27.22 usr +  0.08 sys = 27.30 CPU)
+ @  0.18/s (n=5)
uv_fits_double_bitfiddle: 27 wallclock secs (26.72 usr +  0.00 sys = 2
+6.72 CPU) @  0.19/s (n=5)
33046878 33046878 (48000004)

=== TEST 2: twiddle has an advantage if we don't use SvUV()
Benchmark: timing 10000 iterations of bitfiddle_3x, double3x, empty_fu
+nc, empty_loop...
bitfiddle_3x: 18 wallclock secs (18.39 usr +  0.00 sys = 18.39 CPU) @
+543.74/s (n=10000)
double3x: 30 wallclock secs (29.55 usr +  0.00 sys = 29.55 CPU) @ 33
+8.44/s (n=10000)
empty_func:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
(warning: too few iterations for a reliable count)
empty_loop:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
(warning: too few iterations for a reliable count)
33046875 33046875 48000000 1416858175

Unfortunately, I've not been successful in making a useful version that's faster in a real-world scenario. My current understanding of things (after playing with it for a few hours) is that the single argument versions get swamped out by the looping and function calling overhead. The functions are just too simple to be aHowever, I've not been successful in making a useful version that's significantly faster (with call overhead and such).

NOTES: As you can tell from the above, I've not been successful in measuring the function call (empty func) or empty loop versions of the functions, so my benchmarking could be broken, so I'm including the current code, so people can point things out.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower

Yes, I thought of replacing them with a variable, but decided there wouldn't be that much difference between looking at the value of an SV's IV slot and looking at the value of an IV.
I guess for a few calls there's not much difference, but when you're making 36 million of them it's not hard to believe that things might add up - and I should have thought that through a little better. (Actually, a "lot better".)

Fixing that alone makes uv_fits_double_bitfiddle almost twice as fast as uv_fits_double3 for me:
```Benchmark: timing 1 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  1 wallclock secs ( 0.45 usr +  0.00 sys =  0.45 CPU)
+ @  2.21/s (n=1)
(warning: too few iterations for a reliable count)
uv_fits_double_bitfiddle:  0 wallclock secs ( 0.25 usr +  0.00 sys =
+0.25 CPU)@  4.02/s (n=1)
(warning: too few iterations for a reliable count)
This is pretty much the type of approach whose existence I had wondered about.
It had never been pointed out to me that iv & -iv would identify the least significant set bit, and I'm certainly not sharp enough to have ever realized it myself.
This method is just brilliant ... and it's great that it turns out to be faster, too !!
I'll certainly be using it (with due accreditation to you) unless further testing, contrary to my expectations, reveals some problem with it.

Cheers,
Rob

Rob:

Nice! I'm glad you got a usful result.

Regarding thinking of that trick: I didn't think of it either, but years ago, I read a document that had a lot of bit-fiddling tricks in it, and remembered that there was a trick for that, and my Google-fu let me dredge it up.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Create A New User
Node Status?
node history
Node Type: note [id://11113750]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2021-04-16 10:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?