http://qs321.pair.com?node_id=11113490

in reply to Detecting whether UV fits into an NV

I poked around with it and came up with something for you to figure it out without the looping.

```\$ cat funky_bits.cc
// funky_bits.cpp
//
// experiment with odd bit handling?
//
// 20200227

#include <stdio.h>

// Some test cases
unsigned long long foo[] = {
0xcab312,
0x138,
9007199254740992ULL,
18446744073709551615ULL
};

// Needs a good name, and the type shenanigans should be cleaned up
bool duzzit_fit(unsigned long long t) {

// First we need to identify the lowest bit set
long long neg_t = -(long long)t;
long long last_set = t & neg_t;

// Shift it left 52 bits to get location of lowest invalid bit
// NOTE: If smallest invalid bit is far enough left, then this wil
+l turn into 0
// making the invalid bits also 0, which happens to be OK!
long long smallest_invalid = last_set << 52;

// Convert it to a bit mask for the smallest invalid bit and those
+ to the left.
unsigned long long valid_bits = smallest_invalid - 1;
unsigned long long invalid_bits = ~valid_bits;

// Uncomment for debugging
//printf( "               t:%16llx\n"
//        "        last_set:%16llx\n"
//        "           neg_t:%16llx\n"
//        "smallest_invalid:%16llx\n"
//        "      valid_bits:%16llx\n"
//        "    invalid_bits:%16llx\n",
//        t, last_set, neg_t, smallest_invalid,
//        valid_bits, invalid_bits);

// It's a 'valid' number unless one of the upper-order invalid bit
+s are set
return ! (t & invalid_bits);
}

int main(int, char **) {
for (int z=0; z<sizeof(foo)/sizeof(foo[0]); ++z) {
unsigned long long i = foo[z];
bool b = duzzit_fit(i);
printf("%lld: %s\n", i, b ? "OK" : "**NOPE**");
}
return 0;
}

\$ clang -o funky_bits funky_bits.cc

\$ ./funky_bits
13284114: OK
3735928559: OK
59774856944: OK
15302363377664: OK
312: OK
9007199254740992: OK
-1: **NOPE**

Cleaning it up, adding more test cases, further testing, debugging and simplification are all left as exercises for the reader... ;^D

I was originally thinking of making a lookup table, but thought of this along the way.

Update: fixed the abomination that was the first sentence.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Detecting whether UV fits into an NV
by syphilis (Bishop) on Mar 04, 2020 at 07:48 UTC
I was originally thinking of making a lookup table, but thought of this along the way.

Thanks for that - I still haven't worked my way through it properly.
It seems to work fine now that I've changed:
```long long smallest_invalid = last_set << 52;
to
long long smallest_invalid = last_set << 53;
Without that change, the following values (for example) wrongly reported "NOPE":
```1844674407366955264,
1844674407366955776,
1844674407366956288,
1844674407366956800,
1844674407366957312,
1844674407366957824,
1844674407366958336,
1844674407366958848,
1844674407366959360,
1844674407366959872,
1844674407366960384,
1844674407366960896,
1844674407366961408,
1844674407366961920,
1844674407366962432,
1844674407366962944,
1844674407366963456,
1844674407366963968,
1844674407366964480,
1844674407366964992
In the following XS adaptation it's taking over twice as long as uv_fits_double3.
```use warnings;
use Benchmark;

use Inline C => Config =>
BUILD_NOISY => 1;

use Inline C => <<'EOC';

int uv_fits_double3(SV * x, ...) {
dXSARGS;
int i, count = 0;
UV arg;

for(i = 0; i < items; i++) {
arg = SvUV(ST(i));
if(arg) {
while(!(arg & 1)) arg >>= 1;
if(arg < 9007199254740993) count++;
}
}
return count;
}

int uv_fits_double_bitfiddle(SV * t, ...) {
dXSARGS;
int i, count = 0;

for(i = 0; i < items; i++) {
/* First we need to identify the lowest bit set */
IV neg_t    = -(SvIV(ST(i)));
IV last_set = SvUV(ST(i)) & neg_t;

/* Shift it left 53 bits to get location of lowest invalid bit
+     *
* NOTE: If smallest invalid bit is far enough left, then this wil
+l    *
* turn into 0 making the invalid bits also 0, which happens to be
+ OK! */
IV smallest_invalid = last_set << 53;

UV valid_bits = smallest_invalid - 1;
UV invalid_bits = ~valid_bits;

/* It's a 'valid' number unless one of the upper-order invalid bit
+s are set */
if(!(SvUV(ST(i)) & invalid_bits)) count++;
}
return count;
}

EOC

@in = (1844674407366955161 .. 1844674407378955161);
# @in = (9007199248740992 .. 9007199260740992);
# @in = (184467436737095 .. 184467448737095);
# @in = (184463440737 .. 184475440737);

die "wrong size" unless @in == 12000001;

timethese (1, {
'uv_fits_double3'    =>       '\$count1 = uv_fits_double3(@in);',
'uv_fits_double_bitfiddle' => '\$count2 = uv_fits_double_bitfiddle(@in)
+;',
});

print "\$count1 \$count2\n";

die "Error in at least one of the XSubs" unless \$count1 == \$count2;
__END__
Benchmarking results on my Windows 7 machine with perl-5.30.0 are:
```Benchmark: timing 1 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  0 wallclock secs ( 0.44 usr +  0.00 sys =  0.44 CPU)
+ @  2.29/s (n=1)
(warning: too few iterations for a reliable count)
uv_fits_double_bitfiddle:  1 wallclock secs ( 0.94 usr +  0.00 sys =
+0.94 CPU) @  1.07/s (n=1)
(warning: too few iterations for a reliable count)
12000001 12000001
Can you see room for significant improvement in performance of uv_fits_double_bitfiddle ?

Thanks for another interesting approach !!

Cheers,
Rob

Rob:

I'm not experienced in using Inline::C, but I thought I'd take a look at it. My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower, so I tweaked the routine to explicitly pull arg into a UV as you did in double3:

```\$ diff pm_11113750_orig.pl pm_11113750_a.pl
26a27
>     UV arg;
30,31c31,33
<     IV neg_t    = -(SvIV(ST(i)));
<     IV last_set = SvUV(ST(i)) & neg_t;
---
>     arg = SvUV(ST(i));
>     IV neg_t    = -arg;
>     IV last_set = arg & neg_t;
42c44
<     if(!(SvUV(ST(i)) & invalid_bits)) count++;
---
>     if(!(arg & invalid_bits)) count++;

Doing that change alone made the routines roughly the same speed:

```\$ PERL5OPT= perl pm_11113750_orig.pl
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  7 wallclock secs ( 6.70 usr +  0.00 sys =  6.70 CPU)
+ @  0.75/s (n=5)
uv_fits_double_bitfiddle: 17 wallclock secs (17.36 usr +  0.00 sys = 1
+7.36 CPU) @  0.29/s (n=5)
46875 46875

\$ PERL5OPT= perl pm_11113750_a.pl
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  7 wallclock secs ( 6.75 usr +  0.00 sys =  6.75 CPU)
+ @  0.74/s (n=5)
uv_fits_double_bitfiddle:  6 wallclock secs ( 6.67 usr +  0.00 sys =
+6.67 CPU) @  0.75/s (n=5)
46875 46875

Since the twiddle part doesn't have any loops, I would expect it to be significantly faster than a version *with* loops, depending on how clever the optimizer is. So another version I played with was to modify the routine to take two parameters and count the number of values that fit between those two values. That way, I could greatly reduce the impact of the SvUV() function at the cost of only being able to process sequences of numbers. That version confirmed my suspicion that the SvUV() function is slow enough that it swamps the difference in our algorithms.

```int uv_fits_double3x(SV* the_min, SV* the_max) {
dXSARGS;
UV i_min = SvUV(the_min);
UV i_max = SvUV(the_max);
UV i;
int count = 0;
UV arg;

for(i = i_min; i < i_max; i++) {
arg = i;
if(arg) {
while(!(arg & 1)) arg >>= 1;
if(arg < 9007199254740993) count++;
}
}
return count;
}

int uv_fits_double_bitfiddle_3x(SV* the_min, SV* the_max) {
dXSARGS;
UV i_min = SvUV(the_min);
UV i_max = SvUV(the_max);
UV i;
int count = 0;

for(i = i_min; i < i_max; i++) {
UV arg = i;
IV neg = -arg;
IV smallest_invalid = (arg & -arg)<<53;
UV valid_bits = smallest_invalid-1;
UV invalid_bits = ~valid_bits;
if (! (arg & invalid_bits)) count++;
}
return count;
}

Both routines are much faster if they only take two arguments and iterate over the entire range. Only then does the twiddle version show a significant speed advantage at about twice as fast. The double3x version is about 1800 times faster, while fiddle3x is about 2900 times faster:

```\$ PERL5OPT= perl pm_11113750_x.pl
Name "main::count4" used only once: possible typo at pm_11113750_x.pl
+line 147.
=== Test 1: roughly equivalent speed
Benchmark: timing 5 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3: 28 wallclock secs (27.22 usr +  0.08 sys = 27.30 CPU)
+ @  0.18/s (n=5)
uv_fits_double_bitfiddle: 27 wallclock secs (26.72 usr +  0.00 sys = 2
+6.72 CPU) @  0.19/s (n=5)
33046878 33046878 (48000004)

=== TEST 2: twiddle has an advantage if we don't use SvUV()
Benchmark: timing 10000 iterations of bitfiddle_3x, double3x, empty_fu
+nc, empty_loop...
bitfiddle_3x: 18 wallclock secs (18.39 usr +  0.00 sys = 18.39 CPU) @
+543.74/s (n=10000)
double3x: 30 wallclock secs (29.55 usr +  0.00 sys = 29.55 CPU) @ 33
+8.44/s (n=10000)
empty_func:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
(warning: too few iterations for a reliable count)
empty_loop:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
(warning: too few iterations for a reliable count)
33046875 33046875 48000000 1416858175

Unfortunately, I've not been successful in making a useful version that's faster in a real-world scenario. My current understanding of things (after playing with it for a few hours) is that the single argument versions get swamped out by the looping and function calling overhead. The functions are just too simple to be aHowever, I've not been successful in making a useful version that's significantly faster (with call overhead and such).

NOTES: As you can tell from the above, I've not been successful in measuring the function call (empty func) or empty loop versions of the functions, so my benchmarking could be broken, so I'm including the current code, so people can point things out.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

My first guess is that the overhead in SvUV(ST(i)) is causing twiddle to be slower

Yes, I thought of replacing them with a variable, but decided there wouldn't be that much difference between looking at the value of an SV's IV slot and looking at the value of an IV.
I guess for a few calls there's not much difference, but when you're making 36 million of them it's not hard to believe that things might add up - and I should have thought that through a little better. (Actually, a "lot better".)

Fixing that alone makes uv_fits_double_bitfiddle almost twice as fast as uv_fits_double3 for me:
```Benchmark: timing 1 iterations of uv_fits_double3, uv_fits_double_bitf
+iddle...
uv_fits_double3:  1 wallclock secs ( 0.45 usr +  0.00 sys =  0.45 CPU)
+ @  2.21/s (n=1)
(warning: too few iterations for a reliable count)
uv_fits_double_bitfiddle:  0 wallclock secs ( 0.25 usr +  0.00 sys =
+0.25 CPU)@  4.02/s (n=1)
(warning: too few iterations for a reliable count)