Page 1 of 1

PopCount timing comparisons

Posted: Thu Dec 03, 2015 9:01 pm
by thevinenator
I'm working on converting my chess program from square based to bitboards. So far, so good, already seeing a 40% speed up without doing any optimizing.

The bitboard code makes use of a ScanBit() method to return the next 'on' bit in a bitboard. My current implementation was the code offered on this page:

https://chessprogramming.wikispaces.com ... for+Magics

The methods to count bits and the bitscan seemed liked good candidates for optimization, in fact on the

https://chessprogramming.wikispaces.com/BitScan

page, the section "Index of LS1B by Popcount" hints that if you have a "fast" popcount routine, the bitscan can be improved.

My first step was to compare the published code for PopCount against the CPU hardware implementation.
I wrote some code that checks if the CPU supports popcount and if so, a global flag is set. I modified my PopCount() code as follows:

Code: Select all

int PopCount(U64 b) {
    if (G.POPCNT)
        return  (int)__popcnt64(b);
    int r;
    for (r = 0; b; r++, b &= (b - 1));
    return r;
    }
if the global POPCNT flag is set, then the intrinsic call is made.

I had read that there could be issues with this call, so I wanted to test whether there was any loss of accuracy between the two methods. The following code was used for the test:

Code: Select all

void testbitscan() {

    int i, iTot = 0;
    int iStart;
    U64 uT;

    iStart = GetTimeMs();

    for (i = 0; i < (MILLION * 100); i++)
        {
        uT = RandomU64();
        iTot += PopCount(uT);
        }

    printf_s("%lld\n", iTot);
    printf_s("ET: %lld\n", GetTimeMs() - iStart);

    }
NOTE: I'm using MS VC for development and have no plans at this time to have a Linux port, so all the code is MS centric.

I added code to accumulate the number of bits that were actually counted. I ran the test twice, once to allow the low-level intrinsic and the second to use the C-code.

The results are below. The first number is the total bits, which if divided by 100M, gives a value VERY close to 32, which is what I expected from the RNG. Note the two counts are identical, indicating both methods were equivalent (they better be). The ET is the elapsed time in milli-seconds.

Intrinsic __PopCount 3199999936 ET: 640
C-code 3199999936 ET: 3838

The intrinsic call is 6x faster.

My next step will be to modify the bitscan code to use the LSB technique with PopCount to speed up that method. More on that later.

Re: PopCount timing comparisons

Posted: Fri Dec 04, 2015 9:42 pm
by thevinenator
As a follow up to my original post, i'm adding a timing test for BitScan code.

I'm currently using the following code found on the CPW programming pages.

Code: Select all

int pop_1st_bit(uint64 *bb) {
  uint64 b = *bb ^ (*bb - 1);
  unsigned int fold = (unsigned) ((b & 0xffffffff) ^ (b >> 32));
  *bb &= (*bb - 1);
  return BitTable[(fold * 0x783a9b23) >> 26];
A test against 50M random numbers takes 3.83 seconds.

I then tried the following code:

Code: Select all

int GetNextBit1(U64 *bb)
    {
    unsigned long bit;
    _BitScanForward64(&bit, *bb);
    *bb &= (*bb - 1);
    return bit;
    }
It ran slightly faster at 2.36 seconds.

_BitScanForward64 is a MSVC call, but I think there is a GCC equivalent.

Circling back to my original thread, I referenced an entry on the CPW BitScan page:
Index of LS1B by Popcount
If we have a fast population-count instruction, we can count the trailing zeros of LS1B after subtracting one:

Code: Select all

// precondition bb != 0
int bitScanForward(U64 bb) {
   assert (bb != 0);
   return popCount( (bb & -bb) - 1 );
}
I wanted to try the MSVC intrinsic implementation of popcount, since its speed looks promising, but the above code seems to present a problem.

Code: Select all

return popCount( (bb & -bb) - 1 );
how can you take the negative of a unsigned integer?
What is this code suppose to do?
Does anyone have an implementation that uses popcount that works as a "pop" next bit routine (as shown in the beginning of this response) ?

Re: PopCount timing comparisons

Posted: Thu Dec 10, 2015 9:36 pm
by Gerd Isenberg
thevinenator wrote: I wanted to try the MSVC intrinsic implementation of popcount, since its speed looks promising, but the above code seems to present a problem.

Code: Select all

return popCount( (bb & -bb) - 1 );
how can you take the negative of a unsigned integer?
What is this code suppose to do?
Does anyone have an implementation that uses popcount that works as a "pop" next bit routine (as shown in the beginning of this response) ?
Minus operator of an unsigned int still results in its two's complement, despite still interpreted unsigned. It's the same as ones' complement plus one. To supress compiler warnings you may use (0 - bb) or even (1 + ~bb). The intersection of a value with its two's complement isolates the least significant one bit, applied in many bit twiddling apps
https://chessprogramming.wikispaces.com ... OneBitLS1B

Subtracting one leaves the trailing bits one, and their population equals the bitindex.
e.g bit 3 set:
0000 1000
0000 0001 minus one
0000 0111 leaves 3 trailing bits.