http://qs321.pair.com?node_id=885217

albert has asked for the wisdom of the Perl Monks concerning the following question:

I have a series of strings containing 0s and 1s where I am counting the orientation of the 0/1 via bitwise operators. I am doing this as follows;
my $string1 = '01010110111000'; my $string2 = '11101101100010'; my ( $c00, $c01, $c10, $c11 ) = ( ( $string1 | $string2 ) =~ tr[0][0], # count 00 ( ~$string1 & $string2 ) =~ tr[\1][\1], # count 01 ( $string1 & ~$string2 ) =~ tr[\1][\1], # count 10 ( $string1 & $string2 ) =~ tr[1][1], # count 11 );
However, I have large numbers of these strings, and they are much longer than this example code. For storage reasons, they are stored packed as follows:
my $len = length $unpacked_string; my $packed_string = pack qq{b$len}, $unpacked_string;
In my application, prior to the counting above, I unpack the strings as follows:
my $unpacked_string = unpack qq{b$len}, $packed_string;
I was wondering if there is a way to do the 0/1 counts directly on the packed version, and could skip the need to unpack into strings.

Replies are listed 'Best First'.
Re: Bitwise operators on binary objects
by jwkrahn (Abbot) on Jan 31, 2011 at 10:17 UTC
    I was wondering if there is a way to do the 0/1 counts directly on the packed version

    Yes, unpack can do that:

    $ perl -le' my $string = q/01010110111000/; print $string =~ tr/1//; my $packed_string = pack q/b*/, $string; print unpack q/%32b*/, $packed_string; ' 7 7
      Very nice.

      BTW in contrast to the normal unpack templates, which are (as any experienced perler knows) documented under pack, this feature is documented only under unpack:

      In addition to fields allowed in pack(), you may prefix a field with a %<number> to indicate that you want a <number>-bit checksum of the items instead of the items themselves.
      This is a very neat trick. I turned it into code for my application, and then benchmarked it against my original suggestion, as well as following the suggestion of bit vectors via Bit::Vector::Overload.

      This unpack trick is significantly faster than my previous transliteration based counting. I couldn't figure out the correct bitwise construct to count my '00', but I can derive that from the total expected. The Bit::Vector approach is particularly slow, as I didn't see a way to preload my binary version of the data into the vectors. I'm not sure where the overhead is coming from, but I also doubt that I should get anything better than the unpack approach.

               Rate bitvec string unpack
      bitvec  102/s     --   -97%   -99%
      string 3276/s  3126%     --   -63%
      unpack 8801/s  8565%   169%     --
      
      Code used in testing:
      use Bit::Vector::Overload; use Benchmark qw(cmpthese); use String::Random; #Generate random strings of 01 my $foo = new String::Random; my $length = 600; my @strings; my @bstrings; for ( my $i = 0 ; $i < 10 ; $i++ ) { my $string = $foo->randregex("[01]{$length}"); push @strings, $string; push @bstrings, pack qq{b$length}, $string; } cmpthese( -3, { 'string' => sub { for ( my $i = 0 ; $i < @strings ; $i++ ) { for ( my $j = $i + 1 ; $j < @strings ; $j++ ) { my $string1 = $strings[$i]; my $string2 = $strings[$j]; my ( $c01, $c10, $c11 ) = ( # ( $string1 | $string2 ) =~ tr[0][0], # count 00: COUNT by + math below ( ~$string1 & $string2 ) =~ tr[\1][\1], # c +ount 01 ( $string1 & ~$string2 ) =~ tr[\1][\1], # c +ount 10 ( $string1 & $string2 ) =~ tr[1][1], # c +ount 11 ); my $c00 = $length - $c01 - $c10 - $c11; } } }, 'bitvec' => sub { for ( my $i = 0 ; $i < @strings ; $i++ ) { for ( my $j = $i + 1 ; $j < @strings ; $j++ ) { my $string1 = $strings[$i]; my $string2 = $strings[$j]; my $vec1 = Bit::Vector->new_Bin( $length, $string1 + ); my $vec2 = Bit::Vector->new_Bin( $length, $string2 + ); my ( $v01, $v10, $v11 ) = ( #abs( ~$vec1 & ~$vec2 ), # count 00: COUNT by +math below abs( ~$vec1 & $vec2 ), # count 01 abs( $vec1 & ~$vec2 ), # count 10 abs( $vec1 & $vec2 ), # count 11 ); my $v0 = $length - $v01 - $v10 - $v11; } } }, 'unpack' => sub { for ( my $i = 0 ; $i < @bstrings ; $i++ ) { for ( my $j = $i + 1 ; $j < @bstrings ; $j++ ) { my $bstring1 = $bstrings[$i]; my $bstring2 = $bstrings[$j]; my $p01 = unpack q/%32b*/, ~$bstring1 & $bstr +ing2; my $p10 = unpack q/%32b*/, $bstring1 & ~$bstr +ing2; my $p11 = unpack q/%32b*/, $bstring1 & $bstri +ng2; my $p00 = $length - $p01 - $p10 - $p11; } } } } );
Re: Bitwise operators on binary objects
by eyepopslikeamosquito (Archbishop) on Jan 31, 2011 at 09:25 UTC

      Just to elaborate (as it might not be immediately evident): use the ->Norm() method to count the number bits that are set.

Re: Bitwise operators on binary objects
by Anonyrnous Monk (Hermit) on Jan 31, 2011 at 21:31 UTC

    Newer processors have a popcnt ("population count") instruction, which computes the number of 1-bits (for up to 64 bits) with a single fast CPU instruction.  In other words, if you want real speed at the cost of portability, you could hack a little assembly1,2.

    A quick test shows that this is roughly 10 times as fast as the above mentioned unpack "%32b*".

    #!/usr/bin/perl -wl use strict; use Inline C => Config => LDDLFLAGS => '-shared ../../../popcount.o'; use Inline "C"; use Benchmark "cmpthese"; my $s = pack "C*", map { int rand 256 } 1 .. 2**18; # 256k #print unpack q/%32b*/, $s; #print count_bits($s); cmpthese (-2, { unpack => sub { my $n = unpack q/%32b*/, $s; }, popcnt => sub { my $n = count_bits($s); }, }); __END__ __C__ extern unsigned long popcount(char *, int64_t); // the asm routine unsigned long count_bits (SV* bitstr) { // XS wrapper STRLEN l; char *s = SvPV(bitstr, l); return popcount(s, l); }
    Rate unpack popcnt unpack 1252/s -- -91% popcnt 13581/s 984% --

    Tested with

    $ cat /proc/cpuinfo | grep name | head -1 model name : AMD Phenom(tm) 9600 Quad-Core Processor

    ___

    1 The assembly code (nasm syntax; 64 bit; gcc register subroutine calling convention — compile with nasm -f elf64 popcount.asm ):

    bits 64 global popcount section .text ; prototype: unsigned long popcount(char*, int64_t); popcount: push rbp mov rbp, rsp mov rcx, rsi shr rcx, 3 mov rsi, rdi xor rax, rax .cnt popcnt rbx, [rsi] add rax, rbx add rsi, 8 loop .cnt mov rsp, rbp pop rbp ret

    (Caveat: as this is just proof of concept code, the string length is assumed to be a multiple of 8 byte/64 bit, so you might need to zero-pad it)

    2 In theory, there is also the gcc __builtin_popcount function, but in practice it turns out to be less portable than a piece of assembly code.

      The x86 `loop` instruction is slow; never use it. See this Stack Overflow post about LOOP being slow: https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently

      Write your function in C, using GNU C __builtin_popcountll, and compile with -march=native (to use the popcnt instruction if your CPU supports it)..

      Or if you really want NASM, your function only works if used with a whole number of 64-bit chunks, so I changed the signature to int64_t*. The count is still a byte-count, though.

      ; Untested (check for off-by-one array indexing) ; Probably about 3x as fast as the above for large arrays ; on Intel Sandybridge-family. Unrolling by 2 or so would help. ; prototype: unsigned long popcount(int64_t* rdi, int64_t rsi); ; requires byte count in rsi >= 8, and a multiple of 8. ALIGN 16 global popcount popcount: add rdi, rsi ; rdi = end of buffer neg rsi ; negative index counts up to zero xor rax, rax ;do { .cnt mov rdx, [rdi+rsi] ; mov-load breaks the dependency popcnt rdx, rdx ; on rdx from last iteration. add rax, rdx add rsi, 8 ; index+=8 jl .cnt ;}while(index<0) ret

      popcnt on Intel Sandybridge-family has a false dependency on the destination. Using a separate load avoids creating a loop-carried dependency on the 3-cycle latency of popcnt.

      Using an indexed addressing mode saves an instruction inside the loop. Checking the count and using an unrolled loop would gain speed, especially on AMD Ryzen where popcnt has 4 per clock throughput (so this will bottleneck on the front-end instead of on 2 loads per clock).

      It's 4 fused-domain uops on Intel Sandybridge-family, though (since add / jl can macro-fuse), so it can run at about 1 per clock. It should be at least 3x faster than the old version, which bottlenecks on the popcnt false dependency on Sandybridge-family, or on the LOOP instruction front-end throughput. (Although the old version would actually run ok on AMD Bulldozer-family, where LOOP is fast and POPCNT has no false dependency).

      If you didn't know all this stuff, you should have written your function in C and let the compiler get it right. Or read Agner Fog's optimization guide, instruction tables, and microarch guide (http://agner.org/optimize/). Also, Clang would auto-vectorize for you with AVX2, using vpshufb to count faster (on Haswell and later) than you can with popcnt. By all means look at the C compiler output to make sure you got what you wanted, though.

      Peter Cordes (https://stackoverflow.com/users/224132/peter-cordes)

      Assembly is beyond me. This is a cool suggestion, but the processor on my machine isn't new enough to have this call. Plus, I'm fairly sure that I have other significant bottle necks in my code, so not sure I'd benefit nearly so well from moving part to assembly. Still, a very nice idea.