MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen python-3.x python-3.x

MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen


There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError in your File "model_new.py", line 85, in <module>. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.


About NUMA node warning:

https://github.com/tensorflow/tensorflow/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out// of SysFS. Returns -1 if it cannot...static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {...  string filename =      port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());  FILE *file = fopen(filename.c_str(), "r");  if (file == nullptr) {    LOG(ERROR) << "could not open file to read NUMA node: " << filename               << "\nYour kernel may have been built without NUMA support.";    return kUnknownNumaNode;  } ...  if (port::safe_strto32(content, &value)) {    if (value < 0) {  // See http://b/18228951 for details on this path.      LOG(INFO) << "successful NUMA node read from SysFS had negative value ("                << value << "), but there must be at least one NUMA node"                            ", so returning NUMA node zero";      fclose(file);      return 0;    }

TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!

So, we know that sysfs is mounted into /sys, there is numa_node special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*). Actually it is enabled: CONFIG_NUMA=y - in the deb of your x86_64 4.4.0-78-generic kernel

The special file numa_node is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)

What:       /sys/bus/pci/devices/.../numa_nodeDate:       Oct 2014Contact:    Prarit Bhargava <prarit@redhat.com>Description:        This file contains the NUMA node to which the PCI device is        attached, or -1 if the node is unknown.  The initial value        comes from an ACPI _PXM method or a similar firmware        source.  If that is missing or incorrect, this file can be        written to override the node.  In that case, please report        a firmware bug to the system vendor.  Writing to this file        taints the kernel with TAINT_FIRMWARE_WORKAROUND, which        reduces the supportability of your system.

There is quick (kludge) workaround for this error: find the numa_node of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci output and in /sys/bus/pci/devices/ directory)

echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node

Or just echo it into every such file, it should be rather safe:

for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done

Also your lshw shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.