Page 1 of 1
Houdini and 100% cpu utilization
Posted: Sun Oct 07, 2012 4:48 am
by kevinfat
I've noticed that Houdini achieves 100% cpu utilization across all my cores. And other engines such as Rybka and Stockfish don't. What is holding back Rybka and Stockfish from achieving 100% cpu utilization?
Re: Houdini and 100% cpu utilization
Posted: Mon Oct 08, 2012 9:12 pm
by syzygy
kevinfat wrote:I've noticed that Houdini achieves 100% cpu utilization across all my cores. And other engines such as Rybka and Stockfish don't. What is holding back Rybka and Stockfish from achieving 100% cpu utilization?
This is not necessarily a sign of something holding back Rybka and Stockfish. It could be that Houdini is using spinlocks in places where Rybka and Stockfish are using less cpu intensive synchronisation primitives. Achieving 100% cpu utilization on all cores is very easy: just spawn a large enough number of threads that do nothing but loop.
Re: Houdini and 100% cpu utilization
Posted: Mon Oct 08, 2012 9:14 pm
by BB+
My impression is that the issue is what the engine does when there is no "obvious" things to do (from tree-splitting).
There are a few options that I can see:
- Let the thread sleep for a brief time, like 0.01s. This gives the thread back to the OS, and so reduces cpu utilisation in cases where the engine cannot find any reasonable splittings. If there are still no good splittings after the brief sleep, then sleep again. The downside is that if a good splitting becomes available during the sleep, one would have to wait until the sleep is complete (there is no "wakeup" in the most basic functionality here, at least without some sort of signal handling --- in which case one might as well use the next option).
- Enter a "wait" state (pthread_cond_wait or WaitForSingleObject for example, I think) that will be awakened when a good splitting is found. This allows the OS to do another things with the thread, and should again reduce cpu utilisation in cases where the engine can't find splittings. Any wakeup should be quite fast (about the time of a context switch?).
- Enter a spinlock that will be awakened when a good splitting is found. This keeps cpu utilisation at 100%, though artificially. I'd guess it should be marginally faster in waking up the thread.
- Just search anyway, hoping that any hash effects will be beneficial if nothing else. This should make cpu utilisation be quite near 100%, with at least some benefit (however minor).
I think #2 and #4 are the most reasonable options. #1 and #3 seem to me to be inferior to #2, but maybe I am missing something. OTOH, #4 is a totally different
type of scheme, in that the engine tries to something, rather than do nothing (as in #1-3 -- those options are largely saying
how to do nothing).
Re: Houdini and 100% cpu utilization
Posted: Tue Oct 09, 2012 1:32 am
by hyatt
BB+ wrote:My impression is that the issue is what the engine does when there is no "obvious" things to do (from tree-splitting).
There are a few options that I can see:
- Let the thread sleep for a brief time, like 0.01s. This gives the thread back to the OS, and so reduces cpu utilisation in cases where the engine cannot find any reasonable splittings. If there are still no good splittings after the brief sleep, then sleep again. The downside is that if a good splitting becomes available during the sleep, one would have to wait until the sleep is complete (there is no "wakeup" in the most basic functionality here, at least without some sort of signal handling --- in which case one might as well use the next option).
- Enter a "wait" state (pthread_cond_wait or WaitForSingleObject for example, I think) that will be awakened when a good splitting is found. This allows the OS to do another things with the thread, and should again reduce cpu utilisation in cases where the engine can't find splittings. Any wakeup should be quite fast (about the time of a context switch?).
- Enter a spinlock that will be awakened when a good splitting is found. This keeps cpu utilisation at 100%, though artificially. I'd guess it should be marginally faster in waking up the thread.
- Just search anyway, hoping that any hash effects will be beneficial if nothing else. This should make cpu utilisation be quite near 100%, with at least some benefit (however minor).
I think #2 and #4 are the most reasonable options. #1 and #3 seem to me to be inferior to #2, but maybe I am missing something. OTOH, #4 is a totally different
type of scheme, in that the engine tries to something, rather than do nothing (as in #1-3 -- those options are largely saying
how to do nothing).
Couple of notes.
(1) as you mentioned, the time taken to exit a spin wait (not really a spin lock as nothing needs to be locked... in Crafty I just do something like this:
while (!work_pointer[thread]);
which hangs until a split has been done and work_pointer[thread] was set to point to the correct split block. Benefit is instant thread start-up when work is available. The only cost is the cost of the remote cache (where you set work_pointer[thread] to a pointer) forwarding the modified cache block to your local cache so that you can test the value for zero. We are talking nanoseconds. Downside is smoking a cpu while it is doing no useful work. Dissipates heat, and burns cpu cycles that could be given to something else. I do this because I don't run anything else during a chess game except Crafty, and I even have minimal daemons running (linux) to further reduce the demands that can cause excessive context switches even if not spinning in the above loop.
(2) using any sort of blocking mechanism is quite expensive. Blocking a process and then unblocking is a sure context switch - 2 times in fact. Time is microseconds to milliseconds. Not a lot of loss, but if you do a lot of splits, it adds up. Benefit is that the cpu is idle when the thread is waiting on work to do. Down-side is it takes much longer before the thread gets back to work.
Even worse is using a non-spinlock for locking resources. blocking locks are REALLY bad if you do a significant amount of locking (such as when splitting, unsplitting, selecting next move at a split point, etc... If you use blocking mutex-type locks, you really can't split very near the tips, which hurts load balancing.