Gunicorn does not repondes more than 6 requests at a time
What you describe appears to be an indicator that you running the Gunicorn server with the sync worker class serving an I/O bound application. Can you share your Gunicorn configuration?
Is it possible that Google's platform has some kind of autoscaling feature (I'm not really familiar with their service) that's being triggered while your Kubernetes configuration does not?
Generically speaking increasing the number cores for a single instance will only help if you also increase the number of workers spawned to attend incoming requests. Please see the Gunicorn's design documentation with a special emphasis on the worker types section (and why sync
workers are suboptimal for I/O bound applications) - its a good read and provides a more detailed explanation about this problem.
Just for fun, here's a small exercise to compare the two approaches:
import timedef app(env, start_response): time.sleep(1) # takes 1 second to process the request start_response('200 OK', [('Content-Type', 'text/plain')]) return [b'Hello World']
Running Gunicorn with 4 sync workers: gunicorn --bind '127.0.0.1:9001' --workers 4 --worker-class sync --chdir app app:app
Let's trigger 8 request at the same time: ab -n 8 -c 8 "http://localhost:9001/"
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/Licensed to The Apache Software Foundation, http://www.apache.org/Benchmarking localhost (be patient).....doneServer Software: gunicorn/19.8.1Server Hostname: localhostServer Port: 9001Document Path: /Document Length: 11 bytesConcurrency Level: 8Time taken for tests: 2.007 secondsComplete requests: 8Failed requests: 0Total transferred: 1096 bytesHTML transferred: 88 bytesRequests per second: 3.99 [#/sec] (mean)Time per request: 2006.938 [ms] (mean)Time per request: 250.867 [ms] (mean, across all concurrent requests)Transfer rate: 0.53 [Kbytes/sec] receivedConnection Times (ms) min mean[+/-sd] median maxConnect: 0 1 0.2 1 1Processing: 1003 1504 535.7 2005 2005Waiting: 1002 1504 535.8 2005 2005Total: 1003 1505 535.8 2006 2006Percentage of the requests served within a certain time (ms) 50% 2006 66% 2006 75% 2006 80% 2006 90% 2006 95% 2006 98% 2006 99% 2006 100% 2006 (longest request)
Around 2 seconds to complete the test. That's the behavior you got on your tests - the 4 first requests took kept your workers busy, the second batch was queued until the first batch was processed.
Same test, but let's tell Gunicorn to use an async worker: unicorn --bind '127.0.0.1:9001' --workers 4 --worker-class gevent --chdir app app:app
Same test as above: ab -n 8 -c 8 "http://localhost:9001/"
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/Licensed to The Apache Software Foundation, http://www.apache.org/Benchmarking localhost (be patient).....doneServer Software: gunicorn/19.8.1Server Hostname: localhostServer Port: 9001Document Path: /Document Length: 11 bytesConcurrency Level: 8Time taken for tests: 1.005 secondsComplete requests: 8Failed requests: 0Total transferred: 1096 bytesHTML transferred: 88 bytesRequests per second: 7.96 [#/sec] (mean)Time per request: 1005.463 [ms] (mean)Time per request: 125.683 [ms] (mean, across all concurrent requests)Transfer rate: 1.06 [Kbytes/sec] receivedConnection Times (ms) min mean[+/-sd] median maxConnect: 0 1 0.4 1 2Processing: 1002 1003 0.6 1003 1004Waiting: 1001 1003 0.9 1003 1004Total: 1002 1004 0.9 1004 1005Percentage of the requests served within a certain time (ms) 50% 1004 66% 1005 75% 1005 80% 1005 90% 1005 95% 1005 98% 1005 99% 1005 100% 1005 (longest request)
We actually double the application's throughput here - it only took ~1s to reply to all the requests.
To understand what happened Gevent has a great tutorial about its architecture and this article has a more in-depth explanation about co-routines.
I apologize in advance if was way off on the actual cause of your problem (I do believe that some additional information is lacking from your initial comment for anyone to have a conclusive answer). If not to you, I hope this'll helpful to someone else. :)
Also do notice that I've oversimplified things a lot (my example was a simple proof of concept), tweaking an HTTP server configuration is mostly a trial and error exercise - it's all dependent on the type of workload the application has and the hardware it sits on.