Database performance at any scale

How Amazon DynamoDB is designed to handle anything you can throw at it

Sponsored Feature The web changed more than just how we consume data; it altered how we process and serve it. The design of legacy commercial database engines that supported us in the mid-90s is showing signs of wear. And the key question for companies now is how best to scale up data processing to serve transaction volumes that regularly and rapidly fluctuate.

Amazon Web Services (AWS) released DynamoDB in 2012 to handle this problem. Cloud-based, fully managed database services removed the burden of installing, patching and managing database instances, but still require some level of capacity planning in order to support peak workloads. Planning and provisioning for these is complex, and results in customers paying for allocated resources that are not always needed. Amazon DynamoDB was one of the first database services designed as a serverless database, which both negates the need for peak capacity planning and paying for allocated resources that are not always used.

A different database architecture

To deal with increasing demand for internet-scale volumes, AWS decided to offer a different architecture. First and foremost, DynamoDB was designed as a serverless database that requires zero overhead from an administrative perspective - customers simply create tables to use the database. Secondly, the serverless architecture allows them to scale in milliseconds when they need capacity on demand and only pay for the resources they consume by scaling to zero when resources are not needed. Lastly, AWS turned to NoSQL concepts when creating DynamoDB and used key-value stores, which differ from the relational model.

"Initial customer research showed that for some key relational databases, many of the API calls were for single-key look-ups on a primary key," explains Joe Idziorek, DynamoDB product lead at AWS. "We decided that if we designed for that access pattern, we could do things an order of magnitude better than existing technologies."

Both relational databases and DynamoDB store data in tables with primary keys, but the similarity ends there. The storage structure in a relational system, known as the schema, is rigid. Each table stores multiple records in their own row. Each record holds a set of predefined fields beginning with a primary key uniquely identifying that record.

A database of movie actors for example might include their name, date of birth, salary, and where they live. If you do not know where they live, then you'll need to store a null value in that field.

But what if you decide, after entering 100 actors, that you want to store actor 101's pet's name? Then you would have to run a migration to update the schema, adding the field with a null value for all the actors that do not have that information. That is cumbersome and time consuming, and is not a decision that any organization would take lightly.

DynamoDB breaks the paradigm of storing records and fields. Instead, each movie actor would be an item in the table, containing a list of attributes. Each item has its own combination of attributes. The "dog's name" attribute would only appear in actor 101's attribute list. By using a key-value approach, DynamoDB is very flexible. You do not need to define a schema up front. For modern day applications, DevOps teams update cloud-based apps every few days, or even faster.

"Customers get really creative schema modelling within DynamoDB that matches their application and gives them the flexibility to quickly iterate," Idziorek says. "The ability to do that without having to change a table schema is one of the productivity features of a NoSQL database like DynamoDB that developers really like."

DynamoDB differs from a relational database in another important way, everything is stored in one table. Let us say you want to store each movie that an actor has been in. In a relational system, you would define a separate table and store each movie's information as a record with its own primary key. Then you would create a JOIN table.

If actor 50 (Clint Eastwood) was in movie 12 (Dirty Harry), then you would store both their primary keys (50 and 12) alongside each other in the JOIN table. You would do the same for Harry Guardino, John Vernon, and John Mitchum, who were also in that movie. That means you only have to store the record for Dirty Harry once. It is efficient from a storage perspective, and it also means that if you need to change the info for Dirty Harry, you'd only need to do so in the movie table.

Solving for scalability

DynamoDB works differently. It would store the movie information as a nested attribute in the actor's item. That means Eastwood, Guardino, Vernon, and Mitchum would each contain Dirty Harry's full information, including perhaps year of creation, director, genre, and budget.

"That is inefficient from a storage perspective, but there is a trade-off," says Idziorek. "It is worthwhile when you want to find all the actors in Dirty Harry, or all the movies Clint Eastwood has been in, but you'll need to run an SQL JOIN query. That's computationally expensive and slow, especially when a studio reboots Dirty Harry and everyone starts querying the database at once."

When the relational database was invented storage was expensive, now it is cheap. "It comes down to what you're optimizing for because storage is no longer the bounded resource," Idziorek says. "In many cases storage costs are not the main concern with the duplication of information. Instead what we're trying to do is optimize for query performance."

Partitioning for performance

Optimizing for performance enables DynamoDB to deliver the same performance for the millionth customer as it did for the first, Idziorek explains. It also means that each customer gets the same throughput no matter how many queries they throw at the database. That is because AWS designed DynamoDB from the beginning to partition workloads.

"Relational database architectures retrieve data from storage and process it in a single instance in memory. They were not built to cut queries into multiple parts and work on each at the same time. As a database built for the cloud, DynamoDB changed that", he adds. "We designed DynamoDB with horizontal scaling which means the database isn't constrained by a workload that can only fit on one machine, but instead to partition them across many compute nodes."

The partitioning is transparent to the user, and the database handles it using a partition key. The key serves as the input to a mathematical hashing function that in turn determines which physical storage location will hold the data.

DynamoDB uses this mechanism to store groups of items close together physically while also spreading large data sets out across lots of partitions. That means individual compute nodes can each work on different partitions in parallel to complete a query more quickly. The database also uses a second optional sort key which developers can combine with the partition key to make a composite key. The sort key makes it faster to sort items in a partition further increasing performance.

"One partition could serve an entire key range", explains Idziorek, "but partitions split automatically as the workload grows. AWS matches the partition count to handle the workload and customers can set their own threshold rules to limit their expenditure. They can use an on-demand mode that charges per request removing any limits for high-value, critical workloads that might be unpredictable such as retail. For more predictable workloads where they can absorb some attrition at high volume they can opt for provisioned services with a compute ceiling."

Snap, creator of Snapchat, migrated to DynamoDB in 2018. The company pays for discounted capacity in advance, reducing its capacity while flexing compute power automatically as needed. It saved money and time lopping over 20 percent from the median latency time of sending a Snapchat message.

AWS increased DynamoDB's performance by globally scaling it across multiple regions, of which there are 24 at this time. Administrators can set up a table in each region and write to each one using a local application, which is also referred to as "active-active" replication. Applications get the low-latency read capability in their local region along with fast writes. The managed database engine takes care of cross-regional reconciliation after the write on its own schedule using a 'last writer wins' algorithm.

Serverless operation

DynamoDB was the first serverless database service developed by AWS which has since expanded the portfolio to include Amazon Aurora, Amazon Keyspaces, Amazon Quantum Ledger (QLDB) and Amazon Timestream. As previously mentioned, one of the key attractions of a serverless database is that it can reduce operational costs by only invoking the computing capacity that it needs on demand, enabling customers to pay only for the computing power they use rather than feeding coins into an idle virtual server.

Developers can also use AWS Lambda serverless functionality in combination with DynamoDB Streams to trigger events on demand. Streams is an optional service capturing modifications to DynamoDB tables in near-real time, time stamping them and saving them for 24 hours. Developers can use events captured in Streams to initiate Lambda events. You might use this to initiate an email notification when a nested order attribute in a customer item updated to indicate that it had shipped for example.

As the global pace of digitization picks up, transaction volumes will continue to rise. In addition, the application modernization movement further magnifies the performance gap between traditional relational architectures and NoSQL. Consequently, Idziorek expects DynamoDB to become even more popular among companies racing to keep up.

"That is what's really unique about DynamoDB," he concludes. "It enables customers to future-proof their applications. They don't have to rearchitect them, no matter how successful they become."

Sponsored by AWS.

Similar topics

Similar topics

Similar topics


Send us news