oom-killer kills java application in Docker - mismatch of memory usage reported oom-killer kills java application in Docker - mismatch of memory usage reported docker docker

oom-killer kills java application in Docker - mismatch of memory usage reported


For the first question it would be helpful to see the exact parameters for the JVM.

As you note, there are multiple other parts of memory besides heap, off-heap & metaspace. Data structures related to GC are among them. If you want to control the absolute memory used by the jvm you should use -XX:MaxRAM, although there is a tradefoff with more granular control over the heap and other regions. A common recommendation for containerized apps is:

-XX:MaxRAM='cat /sys/fs/cgroup/memory/memory.limit_in_bytes'

Getting accurate usage measurements is not trivial. This thread from the Mechanical Sympathy list is relevant to the topic. I'll spare copy-pasting, but the link lands on a comment from Gil Tene where the second paragraph is particularly relevant: memory reported is memory actually touched, not allocated. Gil suggests to use -XX:+AlwaysPreTouch to "make sure all heap pages were actually touched (which would force physical memory to actually be allocated, which will make them show up in the used balance)". Related to this, note that your total_vm is 2.44GB, while this is not all in physical memory (as per *_rss) it shows that the process may be allocating far more memory, some of which might eventually pulled into the rss.

With the data available, I think the best pointer comes from the heap graph. Your app's workload is definitely changed at ~18:20: there is more churn, implying allocations and GC work (hence, data). The thread spike may not be an issue, as you say, but it affects jvm mem usage (those ~25 additional threads may require >25MB, depending on your -Xss.) The app's baseline is near the container's limit so it's plausible that after putting more pressure on memory it treads dangerously close to OOM land.

Moving to the second question (and I'm not a Linux expert so this is closer to speculation), in your cgroup stats the mismatch is on rss sizes. AFAIK, rss accounting may include pages that are still on SwapCache, so this could be a cause of the mismatch. Looking at your logs:

memory: usage 491520kB, limit 491520kB, failcnt 28542

memory+swap: usage 578944kB, limit 983040kB, failcnt 0

physical mem is indeed full, and you're swap. My guess is that the same object churn that causes more frequent GC cycles may also result in data being swapped out (where the accounting mismatch may happen). You don't provide io stats prior to the oom-kill, but those would help confirm that the app is indeed swapping, and at what rate. Also, disabling swap on the container might help, as it will avoid spilling to swap and confine churn to the JVM itself, letting you find the right -XX:MaxRAM, or -Xmx.

I hope that helps!


Ok, this is really a late answer, more of an observation rather.When I tried using the -XX:MaxRAM, the OOM Killer still kicked in,Also, how were the NMT readings for the java process?

Do have a look into this article as well