To shift or not to shift
-
- Posts: 68
- Joined: Tue Jun 02, 2015 11:02 pm
- Real Name: Vince
To shift or not to shift
Consider the following code...
void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];
for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);
clock_t begin = clock();
for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;
// try array
// val = PowerOf2;
}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}
Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz
the line: val = 1ULL << i;
takes 31.2 seconds
while the table lookup
val = PowerOf2;
takes 11.9 seconds!
wow
I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?
i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.
void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];
for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);
clock_t begin = clock();
for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;
// try array
// val = PowerOf2;
}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}
Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz
the line: val = 1ULL << i;
takes 31.2 seconds
while the table lookup
val = PowerOf2;
takes 11.9 seconds!
wow
I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?
i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.
"An Engine's strength flows from the Search. But beware, pruning, extensions, reductions; the dark side of the Search are they. Once you start down the dark path, it will dominate and consume you, as it has to so many developers before.” -- Yoda
-
- Posts: 1242
- Joined: Thu Jun 10, 2010 2:13 am
- Real Name: Bob Hyatt (Robert M. Hyatt)
- Location: University of Alabama at Birmingham
- Contact:
Re: To shift or not to shift
thevinenator wrote:Consider the following code...
void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];
for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);
clock_t begin = clock();
for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;
// try array
// val = PowerOf2;
}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}
Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz
the line: val = 1ULL << i;
takes 31.2 seconds
while the table lookup
val = PowerOf2;
takes 11.9 seconds!
wow
I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?
i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.
Have you looked at the asm???
hint: there is not a shift to be found using the code as is (without the array lookup). You are NOT measuring what you think you are... In fact you will discover that inner loop is ALSO missing. You REALLY don't think the compiler realizes that shifting something 64 bits ends up with zero?
-
- Posts: 6
- Joined: Mon Aug 19, 2013 7:32 pm
- Real Name: Lasse Hansen
Re: To shift or not to shift
My experience is that a table lookup for SetBit() is always faster than shift, both with my perft and my engine.
I use ClearBit(sqr) = ~SetBit(sqr) together with this, instead of a separate table for clearing a bit.
It is easy to change to using shift later though.
This is on an i7-2600K.
Regards, Lasse
I use ClearBit(sqr) = ~SetBit(sqr) together with this, instead of a separate table for clearing a bit.
It is easy to change to using shift later though.
This is on an i7-2600K.
Regards, Lasse
-
- Posts: 1242
- Joined: Thu Jun 10, 2010 2:13 am
- Real Name: Bob Hyatt (Robert M. Hyatt)
- Location: University of Alabama at Birmingham
- Contact:
Re: To shift or not to shift
I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe. If you have a decent CPU and the value is in L1, they are going to be fairly close. If not in l1, shift easily wins. But I don't see any way a shift can be slower than a memory/cache reference even if it is a L1 hit. Shift = 1 micro-op. Memory access is at least two.
However, you do have to be careful and make sure your measurements are correct. The code posted above does nothing even remotely close to what the C code implies, for example. Drawing any conclusions from running that code will be based on incorrect assumptions and not knowing exactly what the compiler is doing. gcc -O -S is always your friend if you are looking at this kind of stuff.
IE try the -S with the array reference. There are NONE. Your printing "val" is only misleading you. The inner loop is thrown out. Leaving val = PowerOf2. But then i becomes a constant of 63, the last value through the loop (all other val computations are overwritten). So there are zero memory references for that code. You have to be VERY careful when writing code to benchmark something, less the optimizer wreck your measurement as happens here. I'd ALWAYS use the shift for speed, it is always faster than a memory reference, so long as you don't do things that force multiple memory references, say one for the value to shift, and two or three to evaluate the expression for the shift amount. But then again, that's not particularly good programming either and logically you would have to do the same thing for the array lookup.
However, you do have to be careful and make sure your measurements are correct. The code posted above does nothing even remotely close to what the C code implies, for example. Drawing any conclusions from running that code will be based on incorrect assumptions and not knowing exactly what the compiler is doing. gcc -O -S is always your friend if you are looking at this kind of stuff.
IE try the -S with the array reference. There are NONE. Your printing "val" is only misleading you. The inner loop is thrown out. Leaving val = PowerOf2. But then i becomes a constant of 63, the last value through the loop (all other val computations are overwritten). So there are zero memory references for that code. You have to be VERY careful when writing code to benchmark something, less the optimizer wreck your measurement as happens here. I'd ALWAYS use the shift for speed, it is always faster than a memory reference, so long as you don't do things that force multiple memory references, say one for the value to shift, and two or three to evaluate the expression for the shift amount. But then again, that's not particularly good programming either and logically you would have to do the same thing for the array lookup.
-
- Posts: 6
- Joined: Mon Aug 19, 2013 7:32 pm
- Real Name: Lasse Hansen
Re: To shift or not to shift
Well, its faster for me (tested it once again). It might be that when using variable shift the RC register is clobbered, due to the use of CL, and the RC content has to be restored later, when register pressure is high.hyatt wrote:I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe.
Lasse
-
- Posts: 1242
- Joined: Thu Jun 10, 2010 2:13 am
- Real Name: Bob Hyatt (Robert M. Hyatt)
- Location: University of Alabama at Birmingham
- Contact:
Re: To shift or not to shift
That is still only an extra load from memory, same as the table lookup.
Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
-
- Posts: 6
- Joined: Mon Aug 19, 2013 7:32 pm
- Real Name: Lasse Hansen
Re: To shift or not to shift
Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.hyatt wrote:That is still only an extra load from memory, same as the table lookup.
Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.
Code: Select all
gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220
//#define SetBit(a) (SetBitTbl[a])
//#define ClrBit(a) (~SetBitTbl[a])
#define SetBit(a) (1ull<<(a))
#define ClrBit(a) (~(1ull<<(a)))
Code: Select all
g++ -O2 main.cpp
//inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; }
//inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; }
inline U64 SetBit(int sqr) { return 1ull<<sqr; }
inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }
-
- Posts: 1242
- Joined: Thu Jun 10, 2010 2:13 am
- Real Name: Bob Hyatt (Robert M. Hyatt)
- Location: University of Alabama at Birmingham
- Contact:
Re: To shift or not to shift
Lasse Hansen wrote:Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.hyatt wrote:That is still only an extra load from memory, same as the table lookup.
Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.Code: Select all
gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220 //#define SetBit(a) (SetBitTbl[a]) //#define ClrBit(a) (~SetBitTbl[a]) #define SetBit(a) (1ull<<(a)) #define ClrBit(a) (~(1ull<<(a)))
LasseCode: Select all
g++ -O2 main.cpp //inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; } //inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; } inline U64 SetBit(int sqr) { return 1ull<<sqr; } inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }
BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.
-
- Posts: 6
- Joined: Mon Aug 19, 2013 7:32 pm
- Real Name: Lasse Hansen
Re: To shift or not to shift
I use Ubuntu 14.04 64 bit, gcc and g++ version 4.8.4hyatt wrote:BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.
BTW, if using 1<<shift and not table, the code to clear a bit may be:
Code: Select all
inline U64 ClrBit(int sqr) { return rol(-2ull,sqr); }
static inline U64 rol(U64 x, U64 r)
{
asm ("rolq %%cl, %0" : "=r" (x) : "0" (x), "c" (r));
return x;
}
Lasse
-
- Posts: 1242
- Joined: Thu Jun 10, 2010 2:13 am
- Real Name: Bob Hyatt (Robert M. Hyatt)
- Location: University of Alabama at Birmingham
- Contact:
Re: To shift or not to shift
If you use cl, you certainly clobber rcx.
But irregardless, that is a one cycle instruction. Best you can hope for for a table probe is one cycle also, assuming a l1 cache hit and assuming your l1 is 1 cycle for hit penalty. The shift will never vary, while the table probe won't always be a l1 hit. So I still don't see why it would be slower to use the shift. Be interesting to compile the actual code both ways, using -S, to see what, if anything, is going on that looks odd. GCC has had some interesting optimization bugs over the years, so perhaps here...
Here is what I get from my gcc. If I use the 1ULL < n approach:
movl $1, %eax
movl %edi, %ecx
salq %cl, %rax
ret
and if I use the 64 entry array approach:
movslq %edi, %rdi
movq _bits@GOTPCREL(%rip), %rax
movq (%rax,%rdi,8), %rax
ret
that last movq can be 1 or more cycles depending on l1, giving 3 cycles or more. The first approach has no cache dependencies and should always execute in constant time of 3 cycles.
But irregardless, that is a one cycle instruction. Best you can hope for for a table probe is one cycle also, assuming a l1 cache hit and assuming your l1 is 1 cycle for hit penalty. The shift will never vary, while the table probe won't always be a l1 hit. So I still don't see why it would be slower to use the shift. Be interesting to compile the actual code both ways, using -S, to see what, if anything, is going on that looks odd. GCC has had some interesting optimization bugs over the years, so perhaps here...
Here is what I get from my gcc. If I use the 1ULL < n approach:
movl $1, %eax
movl %edi, %ecx
salq %cl, %rax
ret
and if I use the 64 entry array approach:
movslq %edi, %rdi
movq _bits@GOTPCREL(%rip), %rax
movq (%rax,%rdi,8), %rax
ret
that last movq can be 1 or more cycles depending on l1, giving 3 cycles or more. The first approach has no cache dependencies and should always execute in constant time of 3 cycles.