DynamoDb Incremental Backups – Part Three

If you missed the first two parts of this series, check out Part One and Part Two.

In this post, I’ll go through the inner workings of our Lambda Function.

For our solution, the Lambda function doesn’t need to do much, and really shouldn’t. This needs to be reliable, and self healing.

All we want to do is take the event that is passed from the DynamoDb Stream and push it to the appropriate location in an S3 Bucket.

Check out the source code for the lambda function here: Github Repository

Lets step through an example of what it will do with this event:

{
    "Records":[
        {
            "eventName":"INSERT",
            "eventVersion":"1.0",
            "eventSource":"aws:dynamodb",
            "dynamodb": {
                "NewImage":{
                    "range": {
                        "N": "1"
                    },
                    "id": {
                        "S": "record-1"
                    },
                    "val": {
                        "B": "aGVsbG8="
                    },
                    "map": {
                        "M": {
                            "prop": {
                                "B": "aGVsbG8="
                            }
                        }
                    },
                    "list": {
                        "L": [
                            {
                                "S": "string"
                            },
                            {
                                "B": "aGVsbG8="
                            }
                        ]
                    },
                    "bufferSet": {
                        "BS": [
                            "aGVsbG8="
                        ]
                    }
                },
                "SizeBytes":26,
                "StreamViewType":"NEW_AND_OLD_IMAGES",
                "SequenceNumber":"111",
                "Keys":{
                    "id": {
                        "S": "record-1"
                    }
                }
            },
            "eventID":"1",
            "eventSourceARN":"arn:aws:dynamodb:us-east-1:123456789012:table/fake",
            "awsRegion":"us-east-1"
        }
    ]
}

For each event that was passed to it:

  1. Calculate the key (aka filename) to be used for this event. If PlainTextKeyAsFilename is enabled, it will use the following format: “HASH | RANGE“, otherwise it will calculate an MD5 hash of the keys. MD5 is used to minimise hotspots in S3.
  2. Find the table name of the source event.
  3. If MultiTenancyColumn is enabled, find the data relating to the MultiTenancyColumn and use that part of the prefix in S3. IE: It can be used to seperate client data in S3.
  4. Figure out what sort of event this is (PUT/REMOVE)
  5. Build up a request for S3 using the above information and send it over. The body of the request will be NewImage property inside the event if it is a PUT, otherwise it will be empty.
  6. The prefix used in S3 will be:  [process.env.BackupPrefix]/[TableName from 2]/[MultiTenancyId if enabled from 3]/[Key from 1]

The following parameters can be passed in to the process:

  1. BackupPrefix – Prefix used for all backups
  2. BackupBucket* – Location of the versioned S3 bucket
  3. PlainTextKeyAsFilename – Whether to use plain text keys as Filename (beware of hotspots created)
  4. MultiTenancyColumn – The attribute name which multitenancy identifier

* required

What about the unhappy path?

In the case that something goes wrong at the Lambda Function, heres what happens:

Lets say for example a PUT request times out to S3.

Lambda is smart enough to continue retrying the same event in the DynamoDb stream until it passes.

This happens at one minute intervals. CloudWatch will be monitoring as a default, and when an error has been detected, it will log the error, which allows you to trigger alarms / further actions using SNS (Ie: Email notifications to the team).

DynamoDb Streams : In a little more detail

Events in a DynamoDb Stream are distributed between shards.

Make up of a DynamoDB Stream – Shards

Similar to partitions for DynamoDB tables, data is sharded. These can become quite complex because they can split and merge, but lets not go into that.

Within a shard, events have explicit sequences (in other words, event are ordered within a shard), so if an event times out, it cannot proceed with the next event. It will retry that event until successful. It will retry immediately, and back off to retry every one minute.

If the Lambda Function is configured correctly, all events should be stored in the S3 bucket successfully. The only misconfiguration you can have are:

  1. Incorrect Bucket
  2. Permissions
  3. Incorrect multitenancy column specified (If not in the key, this can be dangerous as it isn’t required)

In our test runs, we have only seen S3 PUT timeouts on one occasion, and that fixed itself with a retry.

S3

As mentioned above, the S3 bucket should be versioning enabled if you wish to enable incremental backups. This allows you to rollback to any point in time.

PUT operation on S3 Versioned Bucket

 

DELETE on S3 Versioned Bucket

What we’ve created here is basically version control for our DynamoDb data. Essentially an immutable store in S3, which allows us to store all data that was ever in DynamoDb – very cost effectively.

NOTE: We found this to be cost effective, but I strongly advise everyone to double check for their scenario.

In the next post, we’ll delve into the restoring of this data.

2 thoughts on “DynamoDb Incremental Backups – Part Three”

  1. It seems like you are using LastModified time of S3-storied object version to find which version corresponds to a given PointInTime. But this value differs from actual time when the DB record was modified on replication lag, that may be quite large in case of lambda failures and retries. Furthermore if there are more then one dynamo-stream shard then they may be processed with separate lambdas concurrently, therefore making possible that two modifications made almost simultaneously in db will be backed up at different time resulting in different LastModified for versions.
    Have you somehow addressed this issues?

    1. Hi Daniil,

      Thanks for the comment – I think you’ve hit the nail on the head with your concerns.
      To address your second concern first, two modifications made in the db almost simultaneously – this is not guaranteed to be pushed to S3 in the same order, unless it is the same item. The ordering of each item update will be respected with Streams + Lambda. The table activity for one item will always be stored in one shard, and is respected with a split/merge of shards as well.
      Your first question – the process of pushing the data from the DynamoDb Stream to S3 via Lambda will have a certain delay. From my experience, this delay varies between 100ms – 1 second within the same region. In my scenario, running this in production for around 9 months, is that I’ve had one failure very early on. Which was immediately solved by a auto-retry by Lambda.
      Keeping in mind, it is rare, but can happen – the restore tool’s reliance on S3’s LastModified is not ideal, but good enough for my scenario. You will need to evaluate it for your context.

      Abhaya

Leave a Reply

Your email address will not be published. Required fields are marked *