Page 1 of 2

Multicore Detection

Posted: Fri Jun 11, 2010 11:34 pm
by kingliveson
Does Houdini need 1, 2, 4, 8 CPU versions? You can simply get the number of cores for each system with a few lines of code.

Code: Select all

int GetNumCPU() 
{
  SYSTEM_INFO si;
  GetSystemInfo( &si );
  CPUS_SIZE = si.dwNumberOfProcessors;
  return CPUS_SIZE;
}
Or perhaps there's good reason for having the multiple versions...I don't know.

Re: Multicore Detection

Posted: Sat Jun 12, 2010 10:43 am
by Robert Houdart
As you can see by comparing the size of the EXE files, the different versions of Houdini contain different code, or more exactly stated, a different number of copies of the same code.

The 1_CPU version doesn't contain any of the hash-locking, thread or split point management code that is required for multi-threading. This results in a more efficient compile, for the 32-bit version we're talking about approx. 5% speed gain.

The multi-core versions contain duplicates of all the evaluation and move playing code, with each copy of the code being hard-wired to an individual internal board representation. For example, the 8_CPU version has 8 copies of the "evaluate" routine, each copy working on its own board representation.

This gives a nice performance boost on Intel processors, but unfortunately, as I'm finding out, this architecture doesn't seem very well adapted to AMD Phenom II processors. I may have to go back to a more traditional architecture, sacrifying some speed on Intel for gaining some speed on AMD.

Houdini does use the number of processors reported by the system to limit the number of threads actually in use - in other words, the 8_CPU version will never use more than 2 cores on a dual core computer.

Robert

Re: Multicore Detection

Posted: Sat Jun 12, 2010 12:27 pm
by kingliveson
Robert Houdart wrote:As you can see by comparing the size of the EXE files, the different versions of Houdini contain different code, or more exactly stated, a different number of copies of the same code.

The 1_CPU version doesn't contain any of the hash-locking, thread or split point management code that is required for multi-threading. This results in a more efficient compile, for the 32-bit version we're talking about approx. 5% speed gain.

The multi-core versions contain duplicates of all the evaluation and move playing code, with each copy of the code being hard-wired to an individual internal board representation. For example, the 8_CPU version has 8 copies of the "evaluate" routine, each copy working on its own board representation.

This gives a nice performance boost on Intel processors, but unfortunately, as I'm finding out, this architecture doesn't seem very well adapted to AMD Phenom II processors. I may have to go back to a more traditional architecture, sacrifying some speed on Intel for gaining some speed on AMD.

Houdini does use the number of processors reported by the system to limit the number of threads actually in use - in other words, the 8_CPU version will never use more than 2 cores on a dual core computer.
Robert
I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.

Re: Multicore Detection

Posted: Sat Jun 12, 2010 5:22 pm
by Sentinel
kingliveson wrote:I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.
Since this is a programming forum, let me try to explain. It's actually quite simple.
Original Ippo/Robbo doesn't have any POSITION arguments (pointers) in functions, since it's single threaded and there is only one POSITION tree. Having one additional argument in each function, slows down whole engine (because of slower function calls) for around 5-10% in 32bit version (depending on exact processor, Intel P4 is around 5%, Intel Core Duos and Q series quads are closer to 10%, i5/i7 are again around 5%), and 3-5% on 64bit version. On AMD processors situation is a bit different and there is less than 5% slow down in 32 bit, and almost no slow down in 64bit.
Now when implementing SMP, one might have all the important functions with hard-coded POSITION1, POSITION2 etc., instead of POSITION pointer function argument. This speeds up things when having only 2 threads, but with 4 or more, the additional overhead (in SMP implementation) starts to kick in.

Re: Multicore Detection

Posted: Sat Jun 12, 2010 8:05 pm
by hyatt
Robert Houdart wrote:As you can see by comparing the size of the EXE files, the different versions of Houdini contain different code, or more exactly stated, a different number of copies of the same code.

The 1_CPU version doesn't contain any of the hash-locking, thread or split point management code that is required for multi-threading. This results in a more efficient compile, for the 32-bit version we're talking about approx. 5% speed gain.
That number is amazing to me. In Crafty, there is exactly one if-test done each node in the parallel version. Using just one thread, that test becomes 100% correct branch prediction. As a result, I see about 0.2% (at worst, usually less) improvement by compiling out the SMP code.

What on earth are you doing to eat 5% of the total cpu time when you have SMP code compiled in but not using more than one thread???

The multi-core versions contain duplicates of all the evaluation and move playing code, with each copy of the code being hard-wired to an individual internal board representation. For example, the 8_CPU version has 8 copies of the "evaluate" routine, each copy working on its own board representation.
Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.

This gives a nice performance boost on Intel processors, but unfortunately, as I'm finding out, this architecture doesn't seem very well adapted to AMD Phenom II processors. I may have to go back to a more traditional architecture, sacrifying some speed on Intel for gaining some speed on AMD.
The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.

Houdini does use the number of processors reported by the system to limit the number of threads actually in use - in other words, the 8_CPU version will never use more than 2 cores on a dual core computer.

Robert

Re: Multicore Detection

Posted: Sat Jun 12, 2010 9:08 pm
by Robert Houdart
hyatt wrote:What on earth are you doing to eat 5% of the total cpu time when you have SMP code compiled in but not using more than one thread???
Sentinel's answer contains part of the reason, there's also the overhead of split-points, hash locking and thread management. Over-all 3 to 5% performance hit between the 1_CPU version and the 2_CPU version used at 1 thread.
If there wasn't, I wouldn't be presenting different versions for download.
hyatt wrote:Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.
Actually the implementation is very clean and rather elegant - C++ templates are the way to go.
hyatt wrote:The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.
On modern processors all the cores have a separate instruction cache (typically 32 kB or 64 kB).
The fact is that certainly up to 4-core Intel CPU's (Intel Core 2 Duo and Core i5) Houdini's approach appears to be the most efficient.

Robert

Re: Multicore Detection

Posted: Sat Jun 12, 2010 11:15 pm
by hyatt
Robert Houdart wrote:
hyatt wrote:What on earth are you doing to eat 5% of the total cpu time when you have SMP code compiled in but not using more than one thread???
Sentinel's answer contains part of the reason, there's also the overhead of split-points, hash locking and thread management. Over-all 3 to 5% performance hit between the 1_CPU version and the 2_CPU version used at 1 thread.
If there wasn't, I wouldn't be presenting different versions for download.
OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?

Hash locking is an ugly idea anyway. The lockless hashing idea is much cleaner and avoids the tons of cache coherency that comes from acquiring locks very frequently.

It is pretty easy to compare SMP and non-SMP crafty to see how low this overhead can be. The only reason I offer a non-SMP compile option is to pick up that last 0.2% that some might want.
hyatt wrote:Ugh. Ignore my previous question. Never considered code would be written in that way. That's so ugly I would never have even considered that idea.
Actually the implementation is very clean and rather elegant - C++ templates are the way to go.
I am not talking about "ugly with respect to coding style". I am talking about "ugly with respect to design and performance issues."
hyatt wrote:The idea is actually bad for Intel as well. You now have 8 different evaluation blocks of instructions, each taking up a part of cache, where with a more traditional approach you would free up 7/8 of the cache this pollutes.
On modern processors all the cores have a separate instruction cache (typically 32 kB or 64 kB).
The fact is that certainly up to 4-core Intel CPU's (Intel Core 2 Duo and Core i5) Houdini's approach appears to be the most efficient.

Robert
If you can run out of L1 you would be correct. But L1 is tiny. L2 is much larger, and newer i7's can have a L3 if you want. Shared. Which means N copies of the same code in cache... From someone that has been doing parallel programming for a _long_ time, this really is not the best approach. The ideal is a general-purpose approach that works well on all architectures. I certainly don't see any performance/scaling issues with Crafty.

Re: Multicore Detection

Posted: Sun Jun 13, 2010 1:04 am
by Robert Houdart
hyatt wrote:OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?
There's a double effect:
- At several locations in the code you need to test whether the engine is running multi-threaded or not
- The split point and thread managament code gets compiled in and breaks some optimization (e.g. register use for local variables); in other words: the surrounding code becomes less efficient.
hyatt wrote:Hash locking is an ugly idea anyway. The lockless hashing idea is much cleaner and avoids the tons of cache coherency that comes from acquiring locks very frequently.
I have coded both spinlock hashing and lockless hashing and can enable either option with a compiler switch.
Lockless hashing does have a cost: in order to XOR the key with the data, one needs to make a local copy of the hash entry. On the other hand, with a spinlock no copying nor XORing is required - and again this makes the surrounding code better optimized.
It may come as a surprise to you, but my testing of Houdini on Core 2 Duo, Pentium 4 and Core i5 shows that up to 4 threads the spinlock approach is slightly more efficient than the lockless hash.
hyatt wrote:The ideal is a general-purpose approach that works well on all architectures.
I don't necessarily agree with the "general-purpose approach". If making different versions for Intel and AMD makes an engine run faster on each platform, why not invest the extra effort?

Robert

Re: Multicore Detection

Posted: Sun Jun 13, 2010 1:41 am
by kingliveson
Sentinel wrote:
kingliveson wrote:I can understand having a separate single CPU compile then a multi-core. Where the single CPU will not have threading code. But soon as you get to 2 CPUs, the code should not differ for 4, or 8 CPUs. Unfortunately we can't see your source maybe some day you will release it and I hope Norm will do the same too. If you are building the executable with Intel- Processor Specific optimization that could explain performance difference when ran on different architecture.
Since this is a programming forum, let me try to explain. It's actually quite simple.
Original Ippo/Robbo doesn't have any POSITION arguments (pointers) in functions, since it's single threaded and there is only one POSITION tree. Having one additional argument in each function, slows down whole engine (because of slower function calls) for around 5-10% in 32bit version (depending on exact processor, Intel P4 is around 5%, Intel Core Duos and Q series quads are closer to 10%, i5/i7 are again around 5%), and 3-5% on 64bit version. On AMD processors situation is a bit different and there is less than 5% slow down in 32 bit, and almost no slow down in 64bit.
Now when implementing SMP, one might have all the important functions with hard-coded POSITION1, POSITION2 etc., instead of POSITION pointer function argument. This speeds up things when having only 2 threads, but with 4 or more, the additional overhead (in SMP implementation) starts to kick in.

Code: Select all

_stacall:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$16, %esp
	movl	$1, -12(%ebp)
	movl	$2, -8(%ebp)
	movl	$3, -4(%ebp)
	movl	-8(%ebp), %edx
	movl	-12(%ebp), %eax
	addl	%edx, %eax
	addl	-4(%ebp), %eax
	leave
	ret

Code: Select all

_dyncall:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$16, %esp
	movl	$0, -4(%ebp)
	jmp	L6
L7:
	movl	-4(%ebp), %eax
	movl	-4(%ebp), %edx
	addl	$1, %edx
	movl	%edx, -16(%ebp,%eax,4)
	addl	$1, -4(%ebp)
L6:
	cmpl	$2, -4(%ebp)
	jle	L7
	movl	-16(%ebp), %edx
	movl	-12(%ebp), %eax
	addl	%eax, %edx
	movl	-8(%ebp), %eax
	leal	(%edx,%eax), %eax
	leave
	ret
To your point regarding hard-coding rather than function generating, yes It will give you a _minor_ boost in speed. The point am making is that threads are still in use. So why clone a function 2, 4, 8 times if overhead is negligible when a single function with multiple calls gets it done. It just seems inefficient to me. I don't see a 5% speed reduction at all. Codes above do exactly the same thing. For simplicity sake, say each line is a clock cycle. if both codes ran 1,000, 10,000, or 100,000 times on a 800 MHz Processor, do you really think the extra lines of code make a difference? I don't think there is need for multiple builds. Will continue to test his releases as they progress anyways.

Re: Multicore Detection

Posted: Sun Jun 13, 2010 5:02 am
by hyatt
Robert Houdart wrote:
hyatt wrote:OK, point by point. There are no split points or thread management if you use only one CPU, so how can those add overhead when there is no threads to create/manage?
There's a double effect:
- At several locations in the code you need to test whether the engine is running multi-threaded or not
Not in my implementation. At the bottom of the main search loop, after I complete searching each move, I simply ask "are there any idle processors available?" Which is expressed like this:

if (idle_processors) {

then do some parallel stuff here.

}

So once for each node searched (excepting q-search, so in reality about once every 5-10 nodes I ask that question. With no extra threads, this is always false since there are no idle threads, and the branch gets predicted correctly 100% of the time. Crafty executes maybe 5,000 instructions per node overall, so we make that 5002 for 10-20% of the nodes. It really is tiny. All my non-SMP code excludes is that test, which is just two instructions.

- The split point and thread managament code gets compiled in and breaks some optimization (e.g. register use for local variables); in other words: the surrounding code becomes less efficient.
hyatt wrote:Hash locking is an ugly idea anyway. The lockless hashing idea is much cleaner and avoids the tons of cache coherency that comes from acquiring locks very frequently.
I have coded both spinlock hashing and lockless hashing and can enable either option with a compiler switch.
Lockless hashing does have a cost: in order to XOR the key with the data, one needs to make a local copy of the hash entry.
You almost certainly do this anyway, in registers. Not sure why it is a problem. Certainly would not be stored in memory with any good compiler.
On the other hand, with a spinlock no copying nor XORing is required - and again this makes the surrounding code better optimized.
It may come as a surprise to you, but my testing of Houdini on Core 2 Duo, Pentium 4 and Core i5 shows that up to 4 threads the spinlock approach is slightly more efficient than the lockless hash.
hyatt wrote:The ideal is a general-purpose approach that works well on all architectures.
I don't necessarily agree with the "general-purpose approach". If making different versions for Intel and AMD makes an engine run faster on each platform, why not invest the extra effort?

Robert
Because those are not the only two architectures around. And what they are doing today is not necessarily what will be done tomorrow. For example, for AMD, are you doing _anything_ about NUMA issues. If you have two chips, 1/2 of memory is slower and 1/2 is faster, and how you get stuff paged into memory is critical.