What is the recommended way to delete a large number of items from DynamoDB? What is the recommended way to delete a large number of items from DynamoDB? database database

What is the recommended way to delete a large number of items from DynamoDB?


What I ideally want to do is call LogTable.DeleteItem(user_id) - Without supplying the range, and have it delete everything for me.

An understandable request indeed; I can imagine advanced operations like these might get added over time by the AWS team (they have a history of starting with a limited feature set first and evaluate extensions based on customer feedback), but here is what you should do to avoid the cost of a full scan at least:

  1. Use Query rather than Scan to retrieve all items for user_id - this works regardless of the combined hash/range primary key in use, because HashKeyValue and RangeKeyCondition are separate parameters in this API and the former only targets the Attribute value of the hash component of the composite primary key..

    • Please note that you''ll have to deal with the query API paging here as usual, see the ExclusiveStartKey parameter:

      Primary key of the item from which to continue an earlier query. An earlier query might provide this value as the LastEvaluatedKey if that query operation was interrupted before completing the query; either because of the result set size or the Limit parameter. The LastEvaluatedKey can be passed back in a new query request to continue the operation from that point.

  2. Loop over all returned items and either facilitate DeleteItem as usual

    • Update: Most likely BatchWriteItem is more appropriate for a use case like this (see below for details).

Update

As highlighted by ivant, the BatchWriteItem operation enables you to put or delete several items across multiple tables in a single API call [emphasis mine]:

To upload one item, you can use the PutItem API and to delete one item, you can use the DeleteItem API. However, when you want to upload or delete large amounts of data, such as uploading large amounts of data from Amazon Elastic MapReduce (EMR) or migrate data from another database in to Amazon DynamoDB, this API offers an efficient alternative.

Please note that this still has some relevant limitations, most notably:

  • Maximum operations in a single request — You can specify a total of up to 25 put or delete operations; however, the total request size cannot exceed 1 MB (the HTTP payload).

  • Not an atomic operation — Individual operations specified in a BatchWriteItem are atomic; however BatchWriteItem as a whole is a "best-effort" operation and not an atomic operation. That is, in a BatchWriteItem request, some operations might succeed and others might fail. [...]

Nevertheless this obviously offers a potentially significant gain for use cases like the one at hand.


According to the DynamoDB documentation you could just delete the full table.

See below:

"Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations"

If you wish to delete only a subset of your data, then you could make separate tables for each month, year or similar. This way you could remove "last month" and keep the rest of your data intact.

This is how you delete a table in Java using the AWS SDK:

DeleteTableRequest deleteTableRequest = new DeleteTableRequest()  .withTableName(tableName);DeleteTableResult result = client.deleteTable(deleteTableRequest);


If you want to delete items after some time, e.g. after a month, just use Time To Live option. It will not count write units.

In your case, I would add ttl when logs expire and leave those after a user is deleted. TTL would make sure logs are removed eventually.

When Time To Live is enabled on a table, a background job checks the TTL attribute of items to see if they are expired.

DynamoDB typically deletes expired items within 48 hours of expiration. The exact duration within which an item truly gets deleted after expiration is specific to the nature of the workload and the size of the table. Items that have expired and not been deleted will still show up in reads, queries, and scans. These items can still be updated and successful updates to change or remove the expiration attribute will be honored.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.htmlhttps://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html