PopCount timing comparisons
Posted: Thu Dec 03, 2015 9:01 pm
I'm working on converting my chess program from square based to bitboards. So far, so good, already seeing a 40% speed up without doing any optimizing.
The bitboard code makes use of a ScanBit() method to return the next 'on' bit in a bitboard. My current implementation was the code offered on this page:
https://chessprogramming.wikispaces.com ... for+Magics
The methods to count bits and the bitscan seemed liked good candidates for optimization, in fact on the
https://chessprogramming.wikispaces.com/BitScan
page, the section "Index of LS1B by Popcount" hints that if you have a "fast" popcount routine, the bitscan can be improved.
My first step was to compare the published code for PopCount against the CPU hardware implementation.
I wrote some code that checks if the CPU supports popcount and if so, a global flag is set. I modified my PopCount() code as follows:
if the global POPCNT flag is set, then the intrinsic call is made.
I had read that there could be issues with this call, so I wanted to test whether there was any loss of accuracy between the two methods. The following code was used for the test:
NOTE: I'm using MS VC for development and have no plans at this time to have a Linux port, so all the code is MS centric.
I added code to accumulate the number of bits that were actually counted. I ran the test twice, once to allow the low-level intrinsic and the second to use the C-code.
The results are below. The first number is the total bits, which if divided by 100M, gives a value VERY close to 32, which is what I expected from the RNG. Note the two counts are identical, indicating both methods were equivalent (they better be). The ET is the elapsed time in milli-seconds.
Intrinsic __PopCount 3199999936 ET: 640
C-code 3199999936 ET: 3838
The intrinsic call is 6x faster.
My next step will be to modify the bitscan code to use the LSB technique with PopCount to speed up that method. More on that later.
The bitboard code makes use of a ScanBit() method to return the next 'on' bit in a bitboard. My current implementation was the code offered on this page:
https://chessprogramming.wikispaces.com ... for+Magics
The methods to count bits and the bitscan seemed liked good candidates for optimization, in fact on the
https://chessprogramming.wikispaces.com/BitScan
page, the section "Index of LS1B by Popcount" hints that if you have a "fast" popcount routine, the bitscan can be improved.
My first step was to compare the published code for PopCount against the CPU hardware implementation.
I wrote some code that checks if the CPU supports popcount and if so, a global flag is set. I modified my PopCount() code as follows:
Code: Select all
int PopCount(U64 b) {
if (G.POPCNT)
return (int)__popcnt64(b);
int r;
for (r = 0; b; r++, b &= (b - 1));
return r;
}
I had read that there could be issues with this call, so I wanted to test whether there was any loss of accuracy between the two methods. The following code was used for the test:
Code: Select all
void testbitscan() {
int i, iTot = 0;
int iStart;
U64 uT;
iStart = GetTimeMs();
for (i = 0; i < (MILLION * 100); i++)
{
uT = RandomU64();
iTot += PopCount(uT);
}
printf_s("%lld\n", iTot);
printf_s("ET: %lld\n", GetTimeMs() - iStart);
}
I added code to accumulate the number of bits that were actually counted. I ran the test twice, once to allow the low-level intrinsic and the second to use the C-code.
The results are below. The first number is the total bits, which if divided by 100M, gives a value VERY close to 32, which is what I expected from the RNG. Note the two counts are identical, indicating both methods were equivalent (they better be). The ET is the elapsed time in milli-seconds.
Intrinsic __PopCount 3199999936 ET: 640
C-code 3199999936 ET: 3838
The intrinsic call is 6x faster.
My next step will be to modify the bitscan code to use the LSB technique with PopCount to speed up that method. More on that later.