OpenCL: Distinguishing computation failure from TDR interrupt OpenCL: Distinguishing computation failure from TDR interrupt windows windows

OpenCL: Distinguishing computation failure from TDR interrupt


Is there a (semi) reliable way to distinguish between an "Out of Resources" caused by TDR and an "Out of Resources" caused by other problems?

1)

If you can access

KeyPath   :HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversKeyValue  : TdrDelay ValueType : REG_DWORD ValueData : Number ofseconds to delay. 2 seconds is the default value.

from WMI to multiply it by

KeyPath   : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversKeyValue  : TdrLimitCountValueType : REG_DWORDValueData : Number of TDRs before crashing. The default value is 5.

again with WMI. You get 10 seconds when you multiply these. And, you should get

KeyPath   :HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversKeyValue  : TdrLimitTime ValueType : REG_DWORD ValueData : Number ofseconds before crashing. 60 seconds is the default value.

that should read 60 seconds from WMI.

For this example computer, it takes 5 x 2-second+1 extra delays before 60 seconds final to crash limit. Then you can check from application if last stopwatch counter exceeded those limits. If yes, probably it is TDR. There is also a thread-exit-from-driver time limit on top of these,

KeyPath   :HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDriversKeyValue  : TdrDdiDelay ValueType : REG_DWORD ValueData : Number ofseconds to leave the driver. 5 seconds is the default value.

which is 5 seconds default. Accessing an invalid memory segment should exit quicker. Maybe you can increase these TDR time limits from WMI up to some minutes so it can let the program compute without crashing becauso of preemption starvation. But changing registry could be dangerous, for example you set TDR time limit to 1 second or some slice of it, then windows may never boot without constant TDR crashes so just reading those variables must be safer.

2)

You separate total work into much smaller parts. If data is not separable, copy it once, then start enqueueing the long-runnning kernel as very-short-ranged-kernels n times with some waiting between any two.

Then, you must be sure that TDR is elliminated. If this version runs but the long-running-kernel doesn't, it is TDR fault.If it is opposite, it is memory crash. Looks like this:

short running x 1024 timeslong runninglong running <---- fail? TDR! because memory would crash short ver. too!long running

another try:

short running x 1024 times <---- fail? memory! because only 1ms per kernellong runninglong running long running

Alternately, can I at least reliably (in Java / through OpenCL API) determine that the GPU used for computation is also running the display?

1)

Use interoperability properties of both devices:

// taken from Intel's site:std::vector<cl_device_id> devs (devNum);//reading the infoclGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, bytes, devs, NULL))

this gives interoperable devices list. You should get its id to exclude it if you don't want to use it.

2)

Have another thread run some opengl or directx static object drawing code to keep one of the gpus busy. Then test all gpus simultaneously using another thread for some trivial opencl kernel codes. Test:

  • opengl starts drawing something with high triangle count @60 fps.
  • start devices for opencl compute, get average kernel executions per second
  • device 1: 30 keps
  • device 2: 40 keps
  • after a while, stop opengl and close its windows(if not already)
  • device 1: 75 keps -----> highest increase in percentage!-->display!!!
  • device 2: 41 keps ----> not as high increase but it can

you should not copy any data between devices while doing this so CPU/RAM will not be bottleneck.

3)

If data is separable, then you can use a divide-and-conquer algorithm to give any gpu get its own work only when it is available and let display part more flexibility (because this is performance-aware solution and could be similar to short-running version but scheduling is done on multiple gpus)

4)

I didn't check because I sold my 2nd gpu but, you should try

CL_DEVICE_TYPE_DEFAULT

in your multi-gpu system to test if it gets display gpu or not. Shut down pc, plug monitor cable to other card, try again. Shut down, change seats of cards, try again. Shut down, remove one of the cards so only 1 gpu and 1 cpu is left, try again. If all these give only display gpu then it should be marking display gpu as default.