DynamoDB: Monitoring Capacity and Throttling
When we create a table in DynamoDB, we provision capacity for the table, which defines the amount of bandwidth the table can accept. When this capacity is exceeded, DynamoDB will throttle read and write requests.
One of the key challenges with DynamoDB is to forecast capacity units for tables, and AWS has made an attempt to automate this; by introducing AutoScaling feature.
AutoScaling has been written about at length (so I won’t talk about it here), a great article by Yan Cui (aka burningmonk) in this blog post.
Essentially, DynamoDB’s AutoScaling tries to assist in capacity management by automatically scaling our RCU and WCUs when certain triggers are hit. Unfortunately, this requires at least 5 – 15 mins to trigger and provision capacity, so it is quite possible for applications, and users to be throttled in peak periods.
Metrics to monitor
Firstly, the obvious metrics we should be monitoring:
- The number of provisioned read capacity units for a table or a global secondary index. This metric is updated every 5 minutes.
- The number of provisioned write capacity units for a table or a global secondary index. This metric is updated every 5 minutes.
- The number of read capacity units consumed over a specified time period, for a table, or global secondary index. This metric is updated every minute.
- The number of write capacity units consumed over a specified time period. This metric is updated every minute.
Most users watch the Consumed vs Provisioned capacity similiar to this:
Don’t forget throttling
Other metrics you should monitor are throttle events. The reason it is good to watch throttling events is because there are four layers which make it hard to see potential throttling:
- In reality, DynamoDB equally divides (in most cases) the capacity of a table into a number of partitions. There are many cases, where you can be throttled, even though you are well below the provisioned capacity at a table level.
- AWS SDKs
- AWS SDKs trying to handle transient errors for you. Things like retries are done seamlessly, so at times, your code isn’t even notified of throttling, as the SDK will try to take care of this for you.This is great, but at times, it can be very good to know when this happens.
- Burst Capacity
- When you are not fully utilizing a partition’s throughput, DynamoDB retains a portion of your unused capacity for later bursts of throughput usage. DynamoDB currently retains up to five minutes of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed. This means you may not be throttled, even though you exceed your provisioned capacity.
- Global Secondary Indexes
- A GSI is written to asynchronously. This is done via an internal queue. As writes a performed on the base table, the events are added to a queue for GSIs. If the queue starts building up (or in other words, the GSI starts falling behind), it can throttle writes to the base table as well.
This means you may not be throttled, even though you exceed your provisioned capacity.
The metrics you should also monitor closely:
- Number of requests to DynamoDB that exceed the provisioned throughput limits on a table or index.
- Number of operations to DynamoDB that exceed the provisioned read capacity units for a table or a global secondary index.
- Number of operations to DynamoDB that exceed the provisioned write capacity units for a table or a global secondary index.
Ideally, these metrics should be at 0. Anything more than zero should get attention.
Whether they are simple CloudWatch alarms for your dashboard or SNS Emails, I’ll leave that to you. Creating effective alarms for your capacity is critical.
Lets take a simple example of a table with 10 WCUs. What triggers would we set in CloudWatch alarms for DynamoDB Capacity?
If you use the SUM statistic on the ConsumedWriteCapacityUnits metric, it allows you to calculate the total number of capacity units used in a set period of time.
For example, if we have assigned 10 WCUs, and we want to trigger an alarm if 80% of the provisioned capacity is used for 1 minute;
10 writes per second * 60 = 600 writes per minute. 80% of 600 = 480 writes in a minute is the trigger point.
Additionally, we could change this to a 5 minute check.
600 writes per minute * 5 minutes = 3000 writes every 5 minutes. 80% of 3000 = 2400 writes in 5 minutes would be the trigger point.
As mentioned earlier, I keep throttling alarms simple. Anything above 0 for ThrottleRequests metric requires my attention.
Keep in mind, we can monitor our Table and GSI capacity in a similiar fashion.
This blog post is only focusing on capacity management. There are other metrics which are very useful, which I will follow up on with another post.