MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen
There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError
in your File "model_new.py", line 85, in <module>
. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.
About NUMA node warning:
// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out// of SysFS. Returns -1 if it cannot...static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {... string filename = port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str()); FILE *file = fopen(filename.c_str(), "r"); if (file == nullptr) { LOG(ERROR) << "could not open file to read NUMA node: " << filename << "\nYour kernel may have been built without NUMA support."; return kUnknownNumaNode; } ... if (port::safe_strto32(content, &value)) { if (value < 0) { // See http://b/18228951 for details on this path. LOG(INFO) << "successful NUMA node read from SysFS had negative value (" << value << "), but there must be at least one NUMA node" ", so returning NUMA node zero"; fclose(file); return 0; }
TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node
file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)
). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1
value in this file!
So, we know that sysfs is mounted into /sys
, there is numa_node
special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*
). Actually it is enabled: CONFIG_NUMA=y
- in the deb of your x86_64 4.4.0-78-generic kernel
The special file numa_node
is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)
What: /sys/bus/pci/devices/.../numa_nodeDate: Oct 2014Contact: Prarit Bhargava <prarit@redhat.com>Description: This file contains the NUMA node to which the PCI device is attached, or -1 if the node is unknown. The initial value comes from an ACPI _PXM method or a similar firmware source. If that is missing or incorrect, this file can be written to override the node. In that case, please report a firmware bug to the system vendor. Writing to this file taints the kernel with TAINT_FIRMWARE_WORKAROUND, which reduces the supportability of your system.
There is quick (kludge) workaround for this error: find the numa_node
of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci
output and in /sys/bus/pci/devices/
directory)
echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node
Or just echo it into every such file, it should be rather safe:
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done
Also your lshw
shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.