Persistent DB Connection in Django/WSGI application Persistent DB Connection in Django/WSGI application python python

Persistent DB Connection in Django/WSGI application


worker daemon managing the connection

Your picture currently looks like:

user  ----------->  webserver  <--------[1]-->  3rd party DBconnection [1] is expensive.

You could solve this with:

user ---->  webserver  <--->  task queue[1]  <--->  worker daemon  <--[2]-> 3rd party DB[1] task queue can be redis, celery or rabbitmq.[2] worker daemon keeps connection open.

A worker daemon would do the connection to the 3rd party database and keep the connection open. This would mean that each request would not have to pay the connection costs. The task queue would be the inter-process communication, dispatching work to the daemon and do the queries in the 3rd party db. The webserver should be as light as possible in terms of processing and let workers do expensive tasks.

preloading with apache + modwsgi

You can actually preload and have the expensive connection done before the first request. This is done with the WSGIImportScript configuration directive. I don't remember at the top of my head if having a pre-load + forking configuration means each request will already have the connection opened and share it; but since you have most of the code, this could be an easy experiment.

preloading with uwsgi

uwsgi supports preloading too. This is done with the import directive.


As far as I can tell, you have ruled out most (all?) of the common solutions to this type of problem:

  • Store connection in a dictionary ... need N workers and can't guarantee which request goes to which worker
  • Store data in cache ... too much data
  • Store connection info in cache ... connection is not serialisable

As far as I can see there is really only 1 'meta' solution to this, use @Gahbu's suggestion of a dictionary and guarantee the requests for a given user go to the same worker. I.e. figure out a way to map from the User object to a given worker the same way every time (maybe hash their name and MOD by the number of workers?).

This solution would not make the most of your N workers if the currently active Users all mapped to the same worker, but if all Users are equally likely to be active at the same time then the work should be equally spread. (If they are not all equally likely then the mapping may be able to account for that).

The two possible ways I can think of doing this would be:

1. Write a custom request allocator

I'm not really familiar with the apache/wsgi interfacing land but ... it might be possible to replace the component within your Apache server that dispatches the HTTP requests to the workers with some custom logic, such that it always dispatches to the same process.

2. Run a load-balancer/proxy in front of N single threaded workers

I'm not sure if you can use a ready to go package here or not, but the concept would be:

  • Run a proxy that implements this 'bind the User to an index' logic
  • Have the proxy then forward the requests to one of N copies of your Apache/wsgi webserver which each has a single worker.

NB: This second idea I came across here: https://github.com/benoitc/gunicorn/issues/183

Summary

For both options the implementation in your existing application is pretty simple. Your application just changes to use a dictionary for storing the persistent connection (creating one if there isn't one already). Testing a single instance is the same in dev as in production. In production, the instances themselves are none the wiser that they are always asked about the same users.

I like Option 2 here for the following reasons:

  • Maybe there is an existing server package that allows you to define this proxy trickiness
  • If not, creating a custom proxy application to sit in front of your current application might not be too hard (especially considering the restrictions you are (already) under when the request reaches the strangedb service)


Rather than having multiple worker processes, you can use use the WSGIDaemonProcess directive to have multiple worker threads which all run in a single process. That way, all the threads can share the same DB connection mapping.

With something like this in your apache config...

# mydomain.com.conf<VirtualHost *:80>    ServerName mydomain.com    ServerAdmin webmaster@mydomain.com    <Directory />        Require all granted    </Directory>    WSGIDaemonProcess myapp processes=1 threads=50 python-path=/path/to/django/root display-name=%{GROUP}    WSGIProcessGroup myapp    WSGIScriptAlias / /path/to/django/root/myapp/wsgi.py</VirtualHost>

...you can then use something as simple as this in your Django app...

# views.pyimport threadfrom django.http import HttpResponse# A global variable to hold the connection mappingsDB_CONNECTIONS = {}# Fake up this "strangedb" moduleclass strangedb(object):    class connection(object):        def query(self, *args):            return 'Query results for %r' % args    @classmethod    def connect(cls, *args):        return cls.connection()# View for homepagedef home(request, username='bob'):    # Remember thread ID    thread_info = 'Thread ID = %r' % thread.get_ident()    # Connect only if we're not already connected    if username in DB_CONNECTIONS:        strangedb_connection = DB_CONNECTIONS[username]        db_info = 'We reused an existing connection for %r' % username    else:        strangedb_connection = strangedb.connect(username)        DB_CONNECTIONS[username] = strangedb_connection        db_info = 'We made a connection for %r' % username    # Fake up some query    results = strangedb_connection.query('SELECT * FROM my_table')    # Fake up an HTTP response    text = '%s\n%s\n%s\n' % (thread_info, db_info, results)    return HttpResponse(text, content_type='text/plain')

...which, on the first hit, produces...

Thread ID = 140597557241600We made a connection for 'bob'Query results for 'SELECT * FROM my_table'

...and, on the second...

Thread ID = 140597145999104We reused an existing connection for 'bob'Query results for 'SELECT * FROM my_table'

Obviously, you'll need to add something to tear down the DB connections when they're no longer required, but it's tough to know the best way to do that without more info about how your app is supposed to work.

Update #1: Regarding I/O multiplexing vs multithreading

I worked with threads twice in my live and each time it was a nightmare. A lot of time was wasted on debugging non reproducible problems. I think an event-driven and a non-blocking I/O architecture might be more solid.

A solution using I/O multiplexing might be better, but would be more complex, and would also require your "strangedb" library to support it, i.e. it would have to be able to handle EAGAIN/EWOULDBLOCK and have the capacity to retry the system call when necessary.

Multithreading in Python is far less dangerous than in most other languages, due to Python's GIL, which, in essence, makes all Python bytecode thread-safe.

In practice, threads only run concurrently when the underlying C code uses the Py_BEGIN_ALLOW_THREADS macro, which, with its counterpart, Py_END_ALLOW_THREADS, are typically wrapped around system calls, and CPU-intensive operations.

The upside of this is that it's almost impossible to have a thread collision in Python code, although the downside is that it won't always make optimal use of multiple CPU cores on a single machine.

The reason I suggest the above solution is that it's relatively simple, and would require minimal code changes, but there may be a better option if you could elaborate more on your "strangedb" library. It seems rather odd to have a DB which requires a separate network connection per concurrent user.

Update #2: Regarding multiprocessing vs multithreading

...the GIL limitations around threading seem to be a bit of an issue. Isn't this one of the reasons why the trend is to use separate processes instead?

That's quite possibly the main reason why Python's multiprocessing module exists, i.e. to provide concurrent execution of Python bytecode across multiple CPU cores, although there is an undocumented ThreadPool class in that module, which uses threads rather than processes.

The "GIL limitations" would certainly be problematic in cases where you really need to exploit every single CPU cycle on every CPU core, e.g. if you were writing a computer game which had to render 60 frames per second in high-definition.

Most web-based services, however, are likely to spend most of their time waiting for something to happen, e.g. network I/O or disk I/O, which Python threads will allow to occur concurrently.

Ultimately, it's trade-off between performance and maintainability, and given that hardware is usually much cheaper than a developer's time, favoring maintainability over performance is usually more cost-effective.

Frankly, the moment you decide to use a virtual machine language, such as Python, instead of a language which compiles into real machine code, such as C, you're already saying that you're prepared to sacrifice some performance in exchange for convenience.

See also The C10K problem for a comparison of techniques for scaling web-based services.