Why is iterating through a large Django QuerySet consuming massive amounts of memory?

Nate C was close, but not quite.

You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():    print e.headline

So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.

From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.

Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:

sql django postgresql django-orm

Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:

https://docs.djangoproject.com/en/dev/topics/pagination/

Something like this:

from django.core.paginator import Paginatorfrom djangoapp.models import modelpaginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can                                                  # change this to desired chunk sizefor page in range(1, paginator.num_pages + 1):    for row in paginator.page(page).object_list:        # here you can do whatever you want with the row    print "done processing page %s" % page

sql django postgresql django-orm

Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:

for event in Event.objects.all().iterator():    print event

https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.

Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.

from django.core.paginator import Paginatordef chunked_iterator(queryset, chunk_size=10000):    paginator = Paginator(queryset, chunk_size)    for page in range(1, paginator.num_pages + 1):        for obj in paginator.page(page).object_list:            yield objfor event in chunked_iterator(Event.objects.all()):    print event

CodeHunter

Why is iterating through a large Django QuerySet consuming massive amounts of memory?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last