To shift or not to shift

thevinenator · Post by **thevinenator** » Wed Sep 09, 2015 1:52 am

Consider the following code...

void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];

for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);

clock_t begin = clock();

for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;

// try array
// val = PowerOf2;

}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}

Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz

the line: val = 1ULL << i;
takes 31.2 seconds

while the table lookup

val = PowerOf2;
takes 11.9 seconds!

wow

I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?

i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.

hyatt · Post by **hyatt** » Wed Sep 09, 2015 2:54 pm

thevinenator wrote:Consider the following code...

void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];

for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);

clock_t begin = clock();

for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;

// try array
// val = PowerOf2;

}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}

Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz

the line: val = 1ULL << i;
takes 31.2 seconds

while the table lookup

val = PowerOf2;
takes 11.9 seconds!

wow

I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?

i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.

Have you looked at the asm???

hint: there is not a shift to be found using the code as is (without the array lookup). You are NOT measuring what you think you are... In fact you will discover that inner loop is ALSO missing. You REALLY don't think the compiler realizes that shifting something 64 bits ends up with zero?

Lasse Hansen · Post by **Lasse Hansen** » Sun Sep 13, 2015 8:53 am

My experience is that a table lookup for SetBit() is always faster than shift, both with my perft and my engine.
I use ClearBit(sqr) = ~SetBit(sqr) together with this, instead of a separate table for clearing a bit.
It is easy to change to using shift later though.
This is on an i7-2600K.

Regards, Lasse

hyatt · Post by **hyatt** » Sun Sep 13, 2015 3:57 pm

I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe. If you have a decent CPU and the value is in L1, they are going to be fairly close. If not in l1, shift easily wins. But I don't see any way a shift can be slower than a memory/cache reference even if it is a L1 hit. Shift = 1 micro-op. Memory access is at least two.

However, you do have to be careful and make sure your measurements are correct. The code posted above does nothing even remotely close to what the C code implies, for example. Drawing any conclusions from running that code will be based on incorrect assumptions and not knowing exactly what the compiler is doing. gcc -O -S is always your friend if you are looking at this kind of stuff.

IE try the -S with the array reference. There are NONE. Your printing "val" is only misleading you. The inner loop is thrown out. Leaving val = PowerOf2. But then i becomes a constant of 63, the last value through the loop (all other val computations are overwritten). So there are zero memory references for that code. You have to be VERY careful when writing code to benchmark something, less the optimizer wreck your measurement as happens here. I'd ALWAYS use the shift for speed, it is always faster than a memory reference, so long as you don't do things that force multiple memory references, say one for the value to shift, and two or three to evaluate the expression for the shift amount. But then again, that's not particularly good programming either and logically you would have to do the same thing for the array lookup.

Lasse Hansen · Post by **Lasse Hansen** » Mon Sep 14, 2015 2:28 pm

hyatt wrote:I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe.

Well, its faster for me (tested it once again). It might be that when using variable shift the RC register is clobbered, due to the use of CL, and the RC content has to be restored later, when register pressure is high.
Lasse

hyatt · Post by **hyatt** » Mon Sep 14, 2015 2:53 pm

That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.

Lasse Hansen · Post by **Lasse Hansen** » Mon Sep 14, 2015 3:55 pm

hyatt wrote:That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.

Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.

Code: Select all

gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220 
//#define SetBit(a)	(SetBitTbl[a])
//#define ClrBit(a)	(~SetBitTbl[a])
#define SetBit(a)	(1ull<<(a))
#define ClrBit(a)	(~(1ull<<(a)))

Code: Select all

g++ -O2 main.cpp
//inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; }
//inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; }
inline U64 SetBit(int sqr) { return 1ull<<sqr; }
inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }

Lasse

hyatt · Post by **hyatt** » Mon Sep 14, 2015 8:35 pm

Lasse Hansen wrote:
hyatt wrote:That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.
Code: Select all
gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220 
//#define SetBit(a)	(SetBitTbl[a])
//#define ClrBit(a)	(~SetBitTbl[a])
#define SetBit(a)	(1ull<<(a))
#define ClrBit(a)	(~(1ull<<(a)))
Code: Select all
g++ -O2 main.cpp
//inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; }
//inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; }
inline U64 SetBit(int sqr) { return 1ull<<sqr; }
inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }
Lasse

BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.

Lasse Hansen · Post by **Lasse Hansen** » Tue Sep 15, 2015 2:54 pm

hyatt wrote:BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.

I use Ubuntu 14.04 64 bit, gcc and g++ version 4.8.4

BTW, if using 1<<shift and not table, the code to clear a bit may be:

Code: Select all

inline U64 ClrBit(int sqr) { return rol(-2ull,sqr); }
static inline U64 rol(U64 x, U64 r) 
{ 
   asm ("rolq %%cl, %0" : "=r" (x) : "0" (x), "c" (r)); 
   return x; 
}

Using rotate left on the constant -2ull saves a bitwise NOT operation, but also clobbers RCX (AFAIK). I tested and found no improvement in speed in my program, but I seldom use ClrBit() in the code.
Lasse

hyatt · Post by **hyatt** » Tue Sep 15, 2015 4:45 pm

If you use cl, you certainly clobber rcx.

But irregardless, that is a one cycle instruction. Best you can hope for for a table probe is one cycle also, assuming a l1 cache hit and assuming your l1 is 1 cycle for hit penalty. The shift will never vary, while the table probe won't always be a l1 hit. So I still don't see why it would be slower to use the shift. Be interesting to compile the actual code both ways, using -S, to see what, if anything, is going on that looks odd. GCC has had some interesting optimization bugs over the years, so perhaps here...

Here is what I get from my gcc. If I use the 1ULL < n approach:

movl $1, %eax
movl %edi, %ecx
salq %cl, %rax
ret

and if I use the 64 entry array approach:

movslq %edi, %rdi
movq _bits@GOTPCREL(%rip), %rax
movq (%rax,%rdi,8), %rax
ret

that last movq can be 1 or more cycles depending on l1, giving 3 cycles or more. The first approach has no cache dependencies and should always execute in constant time of 3 cycles.

OpenChess

OpenChess

To shift or not to shift

To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift

Re: To shift or not to shift