Code: Select all
int GetNumCPU()
{
SYSTEM_INFO si;
GetSystemInfo( &si );
CPUS_SIZE = si.dwNumberOfProcessors;
return CPUS_SIZE;
}
Code: Select all
int GetNumCPU()
{
SYSTEM_INFO si;
GetSystemInfo( &si );
CPUS_SIZE = si.dwNumberOfProcessors;
return CPUS_SIZE;
}
I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.Robert Houdart wrote:As you can see by comparing the size of the EXE files, the different versions of Houdini contain different code, or more exactly stated, a different number of copies of the same code.
The 1_CPU version doesn't contain any of the hash-locking, thread or split point management code that is required for multi-threading. This results in a more efficient compile, for the 32-bit version we're talking about approx. 5% speed gain.
The multi-core versions contain duplicates of all the evaluation and move playing code, with each copy of the code being hard-wired to an individual internal board representation. For example, the 8_CPU version has 8 copies of the "evaluate" routine, each copy working on its own board representation.
This gives a nice performance boost on Intel processors, but unfortunately, as I'm finding out, this architecture doesn't seem very well adapted to AMD Phenom II processors. I may have to go back to a more traditional architecture, sacrifying some speed on Intel for gaining some speed on AMD.
Houdini does use the number of processors reported by the system to limit the number of threads actually in use - in other words, the 8_CPU version will never use more than 2 cores on a dual core computer.
Robert
Since this is a programming forum, let me try to explain. It's actually quite simple.kingliveson wrote:I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.
That number is amazing to me. In Crafty, there is exactly one if-test done each node in the parallel version. Using just one thread, that test becomes 100% correct branch prediction. As a result, I see about 0.2% (at worst, usually less) improvement by compiling out the SMP code.Robert Houdart wrote:As you can see by comparing the size of the EXE files, the different versions of Houdini contain different code, or more exactly stated, a different number of copies of the same code.
The 1_CPU version doesn't contain any of the hash-locking, thread or split point management code that is required for multi-threading. This results in a more efficient compile, for the 32-bit version we're talking about approx. 5% speed gain.
Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.
The multi-core versions contain duplicates of all the evaluation and move playing code, with each copy of the code being hard-wired to an individual internal board representation. For example, the 8_CPU version has 8 copies of the "evaluate" routine, each copy working on its own board representation.
The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.
This gives a nice performance boost on Intel processors, but unfortunately, as I'm finding out, this architecture doesn't seem very well adapted to AMD Phenom II processors. I may have to go back to a more traditional architecture, sacrifying some speed on Intel for gaining some speed on AMD.
Houdini does use the number of processors reported by the system to limit the number of threads actually in use - in other words, the 8_CPU version will never use more than 2 cores on a dual core computer.
Robert
Sentinel's answer contains part of the reason, there's also the overhead of split-points, hash locking and thread management. Over-all 3 to 5% performance hit between the 1_CPU version and the 2_CPU version used at 1 thread.hyatt wrote:What on earth are you doing to eat 5% of the total cpu time when you have SMP code compiled in but not using more than one thread???
Actually the implementation is very clean and rather elegant - C++ templates are the way to go.hyatt wrote:Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.
On modern processors all the cores have a separate instruction cache (typically 32 kB or 64 kB).hyatt wrote:The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.
OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?Robert Houdart wrote:Sentinel's answer contains part of the reason, there's also the overhead of split-points, hash locking and thread management. Over-all 3 to 5% performance hit between the 1_CPU version and the 2_CPU version used at 1 thread.hyatt wrote:What on earth are you doing to eat 5% of the total cpu time when you have SMP code compiled in but not using more than one thread???
If there wasn't, I wouldn't be presenting different versions for download.
I am not talking about "ugly with respect to coding style". I am talking about "ugly with respect to design and performance issues."Actually the implementation is very clean and rather elegant - C++ templates are the way to go.hyatt wrote:Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.
If you can run out of L1 you would be correct. But L1 is tiny. L2 is much larger, and newer i7's can have a L3 if you want. Shared. Which means N copies of the same code in cache... From someone that has been doing parallel programming for a _long_ time, this really is not the best approach. The ideal is a general-purpose approach that works well on all architectures. I certainly don't see any performance/scaling issues with Crafty.On modern processors all the cores have a separate instruction cache (typically 32 kB or 64 kB).hyatt wrote:The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.
The fact is that certainly up to 4-core Intel CPU's (Intel Core 2 Duo and Core i5) Houdini's approach appears to be the most efficient.
Robert
There's a double effect:hyatt wrote:OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?
I have coded both spinlock hashing and lockless hashing and can enable either option with a compiler switch.hyatt wrote:Hash locking is an ugly idea anyway. The lockless hashing idea is much cleaner and avoids the tons of cache coherency that comes from acquiring locks very frequently.
I don't necessarily agree with the "general-purpose approach". If making different versions for Intel and AMD makes an engine run faster on each platform, why not invest the extra effort?hyatt wrote:The ideal is a general-purpose approach that works well on all architectures.
Sentinel wrote:Since this is a programming forum, let me try to explain. It's actually quite simple.kingliveson wrote:I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.
Original Ippo/Robbo doesn't have any POSITION arguments (pointers) in functions, since it's single threaded and there is only one POSITION tree. Having one additional argument in each function, slows down whole engine (because of slower function calls) for around 5-10% in 32bit version (depending on exact processor, Intel P4 is around 5%, Intel Core Duos and Q series quads are closer to 10%, i5/i7 are again around 5%), and 3-5% on 64bit version. On AMD processors situation is a bit different and there is less than 5% slow down in 32 bit, and almost no slow down in 64bit.
Now when implementing SMP, one might have all the important functions with hard-coded POSITION1, POSITION2 etc., instead of POSITION pointer function argument. This speeds up things when having only 2 threads, but with 4 or more, the additional overhead (in SMP implementation) starts to kick in.
Code: Select all
_stacall:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $1, -12(%ebp)
movl $2, -8(%ebp)
movl $3, -4(%ebp)
movl -8(%ebp), %edx
movl -12(%ebp), %eax
addl %edx, %eax
addl -4(%ebp), %eax
leave
ret
Code: Select all
_dyncall:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
jmp L6
L7:
movl -4(%ebp), %eax
movl -4(%ebp), %edx
addl $1, %edx
movl %edx, -16(%ebp,%eax,4)
addl $1, -4(%ebp)
L6:
cmpl $2, -4(%ebp)
jle L7
movl -16(%ebp), %edx
movl -12(%ebp), %eax
addl %eax, %edx
movl -8(%ebp), %eax
leal (%edx,%eax), %eax
leave
ret
Not in my implementation. At the bottom of the main search loop, after I complete searching each move, I simply ask "are there any idle processors available?" Which is expressed like this:Robert Houdart wrote:There's a double effect:hyatt wrote:OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?
- At several locations in the code you need to test whether the engine is running multi-threaded or not
You almost certainly do this anyway, in registers. Not sure why it is a problem. Certainly would not be stored in memory with any good compiler.- The split point and thread managament code gets compiled in and breaks some optimization (e.g. register use for local variables); in other words: the surrounding code becomes less efficient.
I have coded both spinlock hashing and lockless hashing and can enable either option with a compiler switch.hyatt wrote:Hash locking is an ugly idea anyway. The lockless hashing idea is much cleaner and avoids the tons of cache coherency that comes from acquiring locks very frequently.
Lockless hashing does have a cost: in order to XOR the key with the data, one needs to make a local copy of the hash entry.
Because those are not the only two architectures around. And what they are doing today is not necessarily what will be done tomorrow. For example, for AMD, are you doing _anything_ about NUMA issues. If you have two chips, 1/2 of memory is slower and 1/2 is faster, and how you get stuff paged into memory is critical.On the other hand, with a spinlock no copying nor XORing is required - and again this makes the surrounding code better optimized.
It may come as a surprise to you, but my testing of Houdini on Core 2 Duo, Pentium 4 and Core i5 shows that up to 4 threads the spinlock approach is slightly more efficient than the lockless hash.
I don't necessarily agree with the "general-purpose approach". If making different versions for Intel and AMD makes an engine run faster on each platform, why not invest the extra effort?hyatt wrote:The ideal is a general-purpose approach that works well on all architectures.
Robert