Rails query execution causes database spikes Rails query execution causes database spikes multithreading multithreading

Rails query execution causes database spikes


I would suggest a couple of things - two possible solutions, one testing/reproduction approach, and one suggestion for deeper metrics.

1) Possible quick solution: Spin-off that 1 minute job so it is non-blocking. See if the problem resolves from that. Try using Redis+Sidekiq which is pretty simple to get up and running (or something similar).

2) Second possible solution: Look for any full table locks or exclusive row locks being made to Postgres - see if you EVER make full table locks, and if so, find the offending statement and eliminate it.

3) Testing/replication: For testing, see if you can replicate this problem outside of production. I'd recommend jmeter as a very useful tool to simulate a lot of requests and requests of different types, and see if you can repro this in a controlled/staging context. Consistent replication is the key to resolving issues like this. Refer to your production server logs around the time that the issue occurs to generate your jmeter test requests that will hopefully help reproduce the issue.

If you can figure out a way to replicate it, then you can start tuning the simulation to see if removing or increasing/decreasing various requests eliminates the problem or changes the problem in some way..

4) Analytics: Install NewRelic or similar analytics gem to get a deeper insight into what's going on when that request comes in. You really want to get a solid picture as to whether the request is truly blocked in Postgres (by an exclusive row/table lock that is blocking your query) or whether you are backed up by a slow running query in the Puma execution queue, or somewhere inside Ruby there's an unfortunate wait state somehow.

You don't yet have enough information to solve this issue, so you really want to start exploring solutions by collecting data, alongside hypotheses of what's happening.

My general strategy for this kind of problem is (in this order):

  • Try some quick/easy/safe fixes (in the blind) and see if anything is resolved/changed.
  • Try to replicate in a non-production environment (really, really try to make this work).
  • Instrument the system to collect data to see what you can learn about the problem and anything related.