Django uploads: Discard uploaded duplicates, use existing file (md5 based check) Django uploads: Discard uploaded duplicates, use existing file (md5 based check) python python

Django uploads: Discard uploaded duplicates, use existing file (md5 based check)


Thanks to alTus answer, I was able to figure out that writing a custom storage class is the key, and it was easier than expected.

  • I just omit calling the superclasses _save method to write the file if it is already there and I just return the name.
  • I overwrite get_available_name, to avoid getting numbers appended to the file name if a file with the same name is already existing

I don't know if this is the proper way of doing it, but it works fine so far.

Hope this is useful!

Here's the complete sample code:

import hashlibimport osfrom django.core.files.storage import FileSystemStoragefrom django.db import modelsclass MediaFileSystemStorage(FileSystemStorage):    def get_available_name(self, name, max_length=None):        if max_length and len(name) > max_length:            raise(Exception("name's length is greater than max_length"))        return name    def _save(self, name, content):        if self.exists(name):            # if the file exists, do not call the superclasses _save method            return name        # if the file is new, DO call it        return super(MediaFileSystemStorage, self)._save(name, content)def media_file_name(instance, filename):    h = instance.md5sum    basename, ext = os.path.splitext(filename)    return os.path.join('mediafiles', h[0:1], h[1:2], h + ext.lower())class Media(models.Model):    # use the custom storage class fo the FileField    orig_file = models.FileField(        upload_to=media_file_name, storage=MediaFileSystemStorage())    md5sum = models.CharField(max_length=36)    # ...    def save(self, *args, **kwargs):        if not self.pk:  # file is new            md5 = hashlib.md5()            for chunk in self.orig_file.chunks():                md5.update(chunk)            self.md5sum = md5.hexdigest()        super(Media, self).save(*args, **kwargs)


AFAIK you can't easily implement this using save/delete methods coz files are handled quite specifically.

But you could try smth like that.

First, my simple md5 file hash function:

def md5_for_file(chunks):    md5 = hashlib.md5()    for data in chunks:        md5.update(data)    return md5.hexdigest()

Next simple_upload_to is is smth like yours media_file_name function.You should use it like that:

def simple_upload_to(field_name, path='files'):    def upload_to(instance, filename):        name = md5_for_file(getattr(instance, field_name).chunks())        dot_pos = filename.rfind('.')        ext = filename[dot_pos:][:10].lower() if dot_pos > -1 else '.unknown'        name += ext        return os.path.join(path, name[:2], name)    return upload_toclass Media(models.Model):    # see info about storage below    orig_file = models.FileField(upload_to=simple_upload_to('orig_file'), storage=MyCustomStorage())

Of course, it's just an example so path generation logic could be various.

And the most important part:

from django.core.files.storage import FileSystemStorageclass MyCustomStorage(FileSystemStorage):    def get_available_name(self, name):        return name    def _save(self, name, content):        if self.exists(name):            self.delete(name)        return super(MyCustomStorage, self)._save(name, content)

As you can see this custom storage deletes file before saving and then saves new one with the same name.So here you can implement your logic if NOT deleting (and thus updating) files is important.

More about storages ou can find here: https://docs.djangoproject.com/en/1.5/ref/files/storage/


I had the same issue and found this SO question. As this is nothing too uncommon I searched the web and found the following Python package which seams to do exactly what you want:

https://pypi.python.org/pypi/django-hashedfilenamestorage

If SHA1 hashes are out of question I think a pull request to add MD5 hashing support would be a great idea.