http://qs321.pair.com?node_id=1199987

eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:

In More Betterer Game of Life, I needed a popcount (aka Hamming weight) function to sum the one bits in a 64-bit value.

I started with the basic popcount1 below, scraped with little thought from the Hamming_weight wikipedia page.

I'd like to improve that, hence this node.

use strict; use warnings; use Benchmark qw(timethese); sub popcount1 { my $x = shift; my $count; for ($count = 0; $x; ++$count) { $x &= $x - 1 } return $count; } sub popcount2 { return sprintf('%b', shift) =~ tr/1//; } # Update: unpack('%b*', pack('J', shift)) is better (see Rosetta ref. +below) # This works but is slower: unpack('b*', pack('J', shift)) =~ tr/1//; sub popcount3 { return unpack('%32b*', pack('Q', shift)); } my $start = 2**32 - 42; my $end = $start + 1000000; print "sanity test for correctness\n"; for my $i (0 .. 256, $start .. $end) { my $one = popcount1($i); my $two = popcount2($i); my $three = popcount3($i); # print "$i: $one $two $three\n"; $one == $two or die; $one == $three or die; } timethese 50, { One => sub { popcount1($_) for $start .. $end }, Two => sub { popcount2($_) for $start .. $end }, Three => sub { popcount3($_) for $start .. $end }, };

Running the above program on my machine produced:

sanity test for correctness Benchmark: timing 50 iterations of One, Three, Two... One: 29 wallclock secs (28.41 usr + 0.00 sys = 28.41 CPU) @ 1.76/ +s (n=50) Three: 10 wallclock secs (10.00 usr + 0.00 sys = 10.00 CPU) @ 5.00/ +s (n=50) Two: 10 wallclock secs (10.03 usr + 0.00 sys = 10.03 CPU) @ 4.98/ +s (n=50)

Improvements welcome.

References

Update: Added some C/C++ information from Hamming weight (wikipedia).

Some C compilers provide intrinsic functions that provide bit counting facilities. For example, GCC (since version 3.4 in April 2004) includes a builtin function __builtin_popcount that will use a processor instruction if available or an efficient library implementation otherwise. LLVM-GCC has included this function since version 1.5 in June 2005.

In C++ STL, the bit-array data structure bitset has a count() method that counts the number of bits that are set. In C++20, a new header <bit> was added, containing functions std::popcount and std::has_single_bit, taking arguments of unsigned integer types.

I've used this for code that builds with both GCC and MSVC:

// Make GCC __builtin_popcountll available with MSVC // (used to calculate the Hamming weight efficiently) #ifdef _MSC_VER #include <intrin.h> #define __builtin_popcountll __popcnt64 #define __builtin_popcount __popcnt #endif

Updated: Added more references (thanks oiskuu)

Replies are listed 'Best First'.
Re: How to do popcount (aka Hamming weight) in Perl
by BrowserUk (Patriarch) on Sep 24, 2017 at 14:51 UTC

    This (labelled "six" ) is the fastest C/Perl method I found when I was looking a (long) while ago) that works on chips that don't have a popcnt instruction:

    #! perl -slw use strict; package SparseBitVector; use Config; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'Junk', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef __int64 I64; unsigned popcnt( SV *sv ) { U64 x = (U64) SvUVX( sv ); x -=( x >> 1 ) & 0x5555555555555555ULL; x = ( x & 0x3333333333333333ULL ) + ( ( x >> 2 ) & 0x3333333333333 +333ULL ); x = ( x + (x >> 4)) & 0x0f0f0f0f0f0f0f0fULL; return (unsigned)( ( x * 0x0101010101010101ull ) >> 56 ); } END_C use Benchmark qw( cmpthese ); sub popcount1 { my $x = shift; my $count; for ($count = 0; $x; ++$count) { $x &= $x - 1 } return $count; } sub popcount2 { return sprintf('%b', shift) =~ tr/1//; } sub popcount3 { return unpack('%32b*', pack('Q', shift)); } sub popcount4 { return unpack( '%32b*', pack('Q', $_[0] ) ); } my $start = 2**32 - 42; my $end = $start + 1000000; print "sanity test for correctness\n"; for my $i (0 .. 256, $start .. $end) { my $two = popcount2($i); my $three = popcount3($i); my $four = popcount4( $i ); my $six = popcnt( $i ); $three == $two or print "$i: $two $three\n" and die; $four == $six and $three == $six or print "$i: 2:$two 3:$three 6:$s +ix\n" and die; } cmpthese -3, { # One => sub { popcount1($_) for $start .. $end }, Two => sub { popcount2($_) for $start .. $end }, Three => sub { popcount3($_) for $start .. $end }, Four => sub { popcount4($_) for $start .. $end }, Six => sub { popcnt( $_) for $start .. $end }, };

    Outputs:

    C:\test>1199987.pl sanity test for correctness Rate Two Three Four Six Two 1.02/s -- -4% -5% -58% Three 1.06/s 4% -- -1% -57% Four 1.08/s 5% 1% -- -56% Six 2.45/s 140% 131% 128% --

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      If needed, Inline C code supporting 64 and 32 bits.

      use strict; use warnings; use Inline 'C' => Config => BUILD_NOISY => 1, CLEAN_AFTER_BUILD => 0; use Inline 'C' => <<'END_C'; #include <stdint.h> unsigned popcnt2( SV *sv ) { unsigned count = 0; #ifdef __LP64__ static const uint64_t m1 = UINT64_C(0x5555555555555555); static const uint64_t m2 = UINT64_C(0x3333333333333333); static const uint64_t m4 = UINT64_C(0x0f0f0f0f0f0f0f0f); static const uint64_t h01 = UINT64_C(0x0101010101010101); uint64_t x = (uint64_t) SvUVX( sv ); x = x - ((x >> 1) & m1); x = (x & m2) + ((x >> 2) & m2); x = (x + (x >> 4)) & m4; count += (unsigned) ((x * h01) >> 56); #else static const uint32_t m1 = UINT32_C(0x55555555); static const uint32_t m2 = UINT32_C(0x33333333); static const uint32_t m4 = UINT32_C(0x0f0f0f0f); static const uint32_t h01 = UINT32_C(0x01010101); uint32_t x = (uint32_t) SvUVX( sv ); x = x - ((x >> 1) & m1); x = (x & m2) + ((x >> 2) & m2); x = (x + (x >> 4)) & m4; count += (unsigned) ((x * h01) >> 24); #endif return count; } END_C use Benchmark qw( cmpthese ); my $start = 2 ** 32 - 42; my $end = $start + 1000000; cmpthese -3, { Six2 => sub { popcnt2($_) for $start .. $end }, };

      Regards, Mario

Re: How to do popcount (aka Hamming weight) in Perl
by Athanasius (Archbishop) on Sep 24, 2017 at 12:54 UTC

    Hello eyepopslikeamosquito,

    The CPAN module Bit::Fast has a function popcountl, implemented in C, which operates on 64-bit integers and is claimed to be fast.

    Disclaimer: I haven’t been able to test it because — although I have a 64-bit OS and a 64-bit Perl with USE_64_BIT_INT defined — my Perl was apparently built with LONGSIZE defined as 4, not 8, so (with use Config) $Config{longsize} == 4 and popcountl doesn’t get built. I confess I don’t understand the issues involved here — maybe just a configuration problem?

    Anyway, hope that’s of interest,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to do popcount (aka Hamming weight) in Perl
by marioroy (Prior) on Sep 24, 2017 at 16:12 UTC

    Update 2: Thanks Dana, the popcnt function moved to util.h in Math::Prime::Util v0.62.

    Update 1: Added links to popcount.cpp and popcnt. Dana replaced popcnt with mpu_popcount_string in Math::Prime::Util v0.62.

    Found in mce-sandbox/src/bits.h and used here, I received help by reading popcount.cpp from primesieve.org and util.c (popcnt) from Math::Prime::Util <= v0.61. Although the following is tailored for counting set bits inside a string, I'm sharing the code to show-case 64bits and 32bits support inside a function, determined by the __LP64__ pragma.

    #ifndef BITS_H #define BITS_H #include <stdint.h> typedef unsigned char byte_t; static const int popcnt_byte[256] = { 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5, 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8 }; static uint64_t popcount(const byte_t *bytearray, uint64_t size) { uint64_t asize, i, count = 0; if (bytearray == 0 || size == 0) return count; if (size > 8) { #ifdef __LP64__ static const uint64_t m1 = UINT64_C(0x5555555555555555); static const uint64_t m2 = UINT64_C(0x3333333333333333); static const uint64_t m4 = UINT64_C(0x0f0f0f0f0f0f0f0f); static const uint64_t h01 = UINT64_C(0x0101010101010101); const uint64_t *a = (uint64_t *) bytearray; asize = (size + 7) / 8 - 1; for (i = 0; i < asize; i++) { uint64_t b = a[i]; b = b - ((b >> 1) & m1); b = (b & m2) + ((b >> 2) & m2); b = (b + (b >> 4)) & m4; count += (b * h01) >> 56; } i = asize * 8; #else static const uint32_t m1 = UINT32_C(0x55555555); static const uint32_t m2 = UINT32_C(0x33333333); static const uint32_t m4 = UINT32_C(0x0f0f0f0f); static const uint32_t h01 = UINT32_C(0x01010101); const uint32_t *a = (uint32_t *) bytearray; asize = (size + 3) / 4 - 1; for (i = 0; i < asize; i++) { uint32_t b = a[i]; b = b - ((b >> 1) & m1); b = (b & m2) + ((b >> 2) & m2); b = (b + (b >> 4)) & m4; count += (b * h01) >> 24; } i = asize * 4; #endif } else i = 0; for (; i < size; i++) count += popcnt_byte[bytearray[i]]; return count; } #endif

    Regards, Mario

      Dana replaced popcnt with mpu_popcount_string in Math::Prime::Util v0.62.

      I moved the native popcnt to util.h. mpu_popcount_string is a new function to handle bigints or sufficiently magic input without using a bigint library. It's faster than getting Math::BigInt involved until the numbers get over 500 digits. It should be optimized, but it's not exactly the common case.

      It's also useful for making sure we don't get really slow for 64-bit numbers on a 32-bit Perl. Not as fast as forcing 64-bit code, but that's further narrowing down the space -- people on 64-bit machines who install a 32-bit Perl.

      Using Math::BigInt to handle a long digit string would be something like:

      use Math::BigInt; my $n = "16" x 1000; say 0 + (Math::BigInt->new("$n")->as_bin() =~ tr/1//);
      where only the last part is needed if it's already a bigint. Math::GMPz has Rmpz_popcount which is ridiculously fast if the input is already a Math::GMPz object.

Re: How to do popcount (aka Hamming weight) in Perl
by LanX (Saint) on Sep 24, 2017 at 10:37 UTC
    > Improvements welcome.

    Performance wise?

    I think a pre computed lookup table is the way I'd go.

    An array with 2**16 entries should be a good compromise between time and memory complexity.

    (Since Perl doesn't fit into a CPU's line cache)

    Sorry too busy to code it for you :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: How to do popcount (aka Hamming weight) in Perl
by LanX (Saint) on Sep 24, 2017 at 12:58 UTC
    > unpack('%32b*'

    so Perl has already a specialized command for this! (cool)

    but why do you need

    > pack('Q'

    then?

    If it's some kind of normalization, why do you need to benchmark it too instead of keeping the data in that format?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: How to do popcount (aka Hamming weight) in Perl
by danaj (Friar) on Sep 26, 2017 at 19:12 UTC

    There are a fair number of modules that do this. For instance, RosettaCode shows:

    use ntheory qw/hammingweight/; say hammingweight(1234567); use Math::GMPz qw/Rmpz_popcount/; say Rmpz_popcount(Math::GMPz->new(1234567)); use Math::BigInt; say 0 + (Math::BigInt->new(1234567)->as_bin() =~ tr/1//); use Bit::Vector; say Bit::Vector->new_Dec(64,1234567)->Norm;

    The speed of the underlying C code is important if you're calling the function from C where you can see the difference between a few cycles. Once you add Perl overhead it's somewhat less obvious. ntheory for example adds an input validation layer when called from Perl which means it croaks if you give it undef or "hello" or a strange object reference. It recognizes bigints and gives you the correct answer. Bit::Fast and BrowserUK's Inline::C code have no input validation -- it coerces the input into a C integer type and proceeds whether that makes any sense or not. There are many cases validation is just wasted time so it's a tradeoff. ntheory's C code is as fast or faster than BrowserUK's code on 64-bit machines (it uses the popcount asm if possible, otherwise identical C code), Bit::Fast's C code is either the same speed as ntheory, the same speed but with incorrect results, slightly slower, or infinitely slower due to a segfault or inability to compile, depending on the C compiler.

    Once we start calling from Perl, the differences in overhead become more apparent. Note that BrowserUK's benchmark as posted greatly favors his Inline::C function, as all the others get an added Perl subroutine call added to them, and Perl sub calls are not at all cheap compared to this function. Correcting that, and adding some modules, I get:

                    Rate bigint  GMPz bitvector Three Four  Two ntheory Inline raw_nt bitfast
    bigint    4.56e-02/s     --  -93%      -94%  -99% -99% -99%   -100%  -100%  -100%   -100%
    GMPz         0.667/s  1361%    --      -18%  -85% -85% -86%    -93%   -95%   -96%    -96%
    bitvector    0.811/s  1676%   22%        --  -81% -82% -83%    -92%   -93%   -95%    -95%
    Three         4.32/s  9363%  548%      433%    --  -4% -10%    -57%   -65%   -71%    -71%
    Four          4.50/s  9763%  575%      455%    4%   --  -6%    -56%   -64%   -70%    -70%
    Two           4.81/s 10434%  621%      493%   11%   7%   --    -53%   -61%   -68%    -68%
    ntheory       10.1/s 22087% 1419%     1149%  134% 125% 111%      --   -18%   -32%    -33%
    Inline        12.4/s 27108% 1763%     1432%  188% 176% 158%     23%     --   -17%    -18%
    raw_nt        15.0/s 32729% 2148%     1748%  247% 233% 212%     48%    21%     --     -1%
    bitfast       15.1/s 33012% 2167%     1764%  250% 236% 214%     49%    22%     1%      --
    

    Some of these functions have more features than others, e.g. Math::GMPz, bigint, and ntheory will handle bigints correctly. raw_nt is ntheory with a new trivial XS function added that just takes a UV and returns the result of popcnt -- no validation or input size checks, just like Bit::Fast or the Inline C function. Not surprisingly, it comes out the same speed on this machine (both use gcc with asm popcnt support). We are talking about an instruction with 1 cycle latency on newer processors.