How to interpret TensorFlow output? How to interpret TensorFlow output? python python

How to interpret TensorFlow output?


About NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa

Roughly speaking, if you have dual-socket CPU, they will each have their own memory and have to access the other processor's memory through a slower QPI link. So each CPU+memory is a NUMA node.

Potentially you could treat two different NUMA nodes as two different devices and structure your network to optimize for different within-node/between-node bandwidth

However, I don't think there's enough wiring in TF right now to do this right now. The detection doesn't work either -- I just tried on a machine with 2 NUMA nodes, and it still printed the same message and initialized to 1 NUMA node.

DMA = Direct Memory Access. You could potentially copy things from one GPU to another GPU without utilizing CPU (ie, through NVlink). NVLink integration isn't there yet.

As far as the error, TensorFlow tries to allocate close to GPU max memory so it sounds like some of your GPU memory is already been allocated to something else and the allocation failed.

You can do something like below to avoid allocating so much memory

config = tf.ConfigProto(log_device_placement=True)config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAMconfig.operation_timeout_in_ms=15000   # terminate on long hangssess = tf.InteractiveSession("", config=config)


  • successfully opened CUDA library xxx locally means that the library was loaded, but it does not meant that it will be used.
  • successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero means that your kernel does not have NUMA support. You can read about NUMA here and here.
  • Found device 0 with properties: you have 1 GPU which you can use. It lists the properties of this GPU.
  • DMA is direct memory access. More information on Wikipedia.
  • failed to allocate 11.15G the error clearly explains why this happened, but it is hard to tell why do you need so much memory without looking at the code.
  • pool allocator messages are explained in this answer