How AWS created Aurora, a database built for the cloud

Rewriting the rules with logs

Sponsored Relational databases power most of the world’s applications, but the relational model is also decades old. In traditional RDBMS deployments, the strain is beginning to show. Traditional RDBMS engines running on-premises in customers’ data centers increasingly face problems scaling to match the demands of modern applications.

The increased pressure on traditional relational databases comes from a fundamental change in the applications we’re building. Today’s software has to serve an order of magnitude more users than it did when relational databases first appeared, while maintaining fast performance for a demanding audience that’s often distributed worldwide. They must support far greater numbers of connections from a user base that hammers the database with queries around the clock. Those queries can also spike based on factors ranging from seasonal fluctuations to short-term business events, not all of which are predictable.

In addition, the data explosion shows no sign of stopping. The tech analyst firm IDC estimates the amount of digital data worldwide will double in four years, growing by a CAGR of 22.9 percent through 2025. As Jeffrey Hojlo, an IDC program director, writes, “the deluge of data that every company is contending with, from connected products, assets, processes, and customers, is difficult to leverage without the applications in place that enable collation of information and rapid decision making.”

These drivers compelled Amazon to reinvent the RDBMS for a new generation of web-scale applications in the form of Aurora, a relational database built for the cloud. It’s built on open source, extending the functionality of MySQL and PostgreSQL to meet the demands of modern applications.

"Our customers kept asking for a modern database," explains Aditya Samant, senior database specialist solutions architect at AWS, adding that the old-guard databases came with license management and other strings attached. Customers wanted the ease and simplicity of open source with the scalability and performance of commercial engines. "That's how Aurora was born. It's a database that we built for the cloud."

The benefits of a cloud-native RDBMS

Aurora shares the same commercial benefits as Amazon's other managed services, in that it doesn't burden customers with swingeing vendor license fees. Instead, you pay as you go, based on the amount of compute and storage used. You don't even pay for backups as long as those files don't exceed the size of the original database.

The real benefit of the system lies in its cloud-native beginnings, though. Unlike other relational databases such as Oracle and SQL Server, Aurora was built for the cloud from the ground up. That allowed Amazon to craft code that took advantage of the cloud’s scalability and distributed operation. It enables customers to add more compute and storage capacity at will, while also spreading instances around the world to serve global users while maintaining performance.

One place that showed up most clearly was in the database's approach to the storage layer. Traditional RDBMS implementations have tried to reduce IO bottlenecks and improve performance using hardware developments such as local NVMe drives. That improves performance but it still leaves the database dependent on local storage hardware that could fail.

Amazon wanted to decouple compute and storage to reduce reliability and management overheads while improving performance. To do that, it had to rethink the storage subsystem from the ground up.

Rewriting the rules with logs

Aurora's design teams focused on logs. These have always been there in RDBMS, but they were a resilience measure. Relational systems work with data in pages, which they flush to disk occasionally. The logs copy that data so that the database can reconstruct a page if it's lost before it's flushed.

Using logs as a backstop while flushing pages to disk provides database durability but it's also hard on network and I/O performance. Writing a single logical database transaction as part of a page flush can involve multiple physical disk writes.

The AWS team flipped the script when designing Aurora by turning the logs into the primary storage mechanism. The database gathers logs into 4Kb-sized pages, compared to Postgres's 8Kb and MySQL's 16Kb. These log pages are the only things that cross the network boundary; the data writes stay in memory. The AWS storage layer built to support Aurora reconstructs the data writes from the logs at its leisure.

This offers several benefits. If a database ever needs to recover corrupted data, it can do so in a fraction of the time because the storage layer has done the heavy lifting. A traditional RDBMS would have to replay the log files to catch up. It also allows customers to offload a lot of the database management tasks like backups, patches, and other administrative tasks associated with storage processing so that the database can get on with serving queries.

That was a key benefit for Dow Jones, which moved a critical customer retention workload to Aurora. The news organization had a legacy system that had performance problems in spite of consuming $400,000 annually in DBA skills. Moving the workload to Aurora gave it 200 transactions per second and automated replication for disaster recovery, reducing its management costs dramatically while giving the company the performance it needed.

A cloud distributed database

The other thing a cloud-native architecture allowed the Aurora team to do is create a truly distributed relational system. AWS has 24 geographical regions, each with multiple availability zones. They contain dozens of data centers. Amazon has crafted the network links between these facilities to operate at single-digit millisecond latency. The company wanted Aurora to distribute its databases around these zones to make it more durable still, and so designed it to continue operating even if an entire zone goes down, and to recover quickly from larger outages.

Relational systems use replication to improve resilience. They either block transactions until the replicated writes all succeed (known as synchronous replication) or they let the replication complete at its own pace (known as asynchronous replication). Synchronous replication can hinder database performance if one of the replicated systems is slow to respond, while the asynchronous kind risks data loss if there's a failure before the replication completes.

Amazon combined the two using the concept of a quorum. Aurora writes its log files to six separate nodes across three availability zones in the AWS infrastructure, but only needs four of the writes to complete. That gives it the resilience it needs to keep running through a major failure. It can do this economically in part because it's only writing log files and offloading a lot of the data management to the storage layer.

Chosen nodes that didn't get the update can catch up later thanks to AWS's consensus protocol. This uses what it calls 'gossip', in which nodes check with each other to identify any holes in their data records.

AWS further protects Aurora against failures by dividing its data volumes into logical 10Gb chunks, replicating them to create six physical copies, each spread across a large, distributed infrastructure. That allows it to repair any corrupted data quickly by replicating a 10Gb data chunk across its high-speed internal network.

This provides customers with highly-available databases that can recover automatically from infrastructure failures due to natural or logical disasters, enabling their business operations to continue with minimal disruption. It’s something that would take a lot of heavy lifting for DBAs to organize using an on-premises database, but it happens invisibly in the AWS cloud, reducing the workload for DBAs and enabling them to concentrate on optimising database logic for more functionality.

Read replicas

Aurora also takes advantage of another performance enhancement in the underlying AWS infrastructure that makes more efficient use of read replicas. These read-only database instances help to improve database performance by read scaling, reducing the load on the primary Aurora instance. You can have up to 15 of them compared to MySQL's five, and they serve as automated failover targets with little impact on the primary.

Aurora Replicas consume the log files as they're sent to the quorum, comparing their contents to the records they already have in memory and updating their contents accordingly. If a read replica receives a request for a record that it doesn't have in memory, it grabs it from the same storage subsystem that the Aurora database writes to.

Customers can also configure Aurora Global Database to enable faster physical replication between Aurora clusters across different regions, not just availability zones. This gives you replication across up to five secondary regions.

Intuit’s commerce platform uses the Global Database option to offer low-latency, read-only access to data including pricing information across different regions. It achieves this using sub-second global replication, which also serves the company’s disaster recovery needs. Failover takes less than a minute.

Aurora's cloud-native capabilities enable it to deliver performance improvements over traditional RDBMS systems at scale. Amazon says that it achieves over five times as many SELECT and UPDATE queries as MySQL running the same benchmarks on the same hardware.

The Aurora team achieved these gains without sacrificing any traditional RDBMS must-haves, Samant says, explaining that Amazon added the innovations underneath the open source code.

"So, if you have written something in MySQL code, it's going to work for Aurora. If you're writing Postgres code it's going to work for Aurora," he says. "ACID contracts are 100 percent guaranteed, just like they are with your traditional relational databases."

All the hard work happens behind the scenes as part of AWS's managed database offering, shielding DBAs from the administrative heavy lifting. The other thing Aurora shields them from is provisioning issues, he says. "Decoupling the compute and the storage enables us to be elastic, scaling storage up and down as required," he explains. "When you create an Aurora cluster, we don't even ask you how much storage you need."

Evolving Aurora

AWS continues to evolve Aurora over time, making more use of native cloud features. One example is serverless functionality, which allows it to scale the compute layer along the same lines as storage. AWS announced support for Amazon Aurora Serverless v2 last year, which allows scaling compute resources without taking a database instance down.

The next generation serverless capability, which is available in public preview right now, eliminates the need to switch an instance to a different virtual machine if you need to increase your processor count to cope with a transaction spike. "We think most deployments will end up using Aurora Serverless v2," Samant predicts.

Looking back, it's easy to ask why early RDBMS developers didn't use the tricks in Aurora's toolbox. Using log files as a primary recording mechanism seems like a no-brainer because of the performance improvements. But then, the cloud infrastructure that makes this possible wasn't around back then. The underlying infrastructure to support this model would be expensive even in a modern data center, but the AWS cloud’s scalability makes it sustainable. It also brings some of the cloud's biggest benefits, such as scalability and high availability, to DBAs transparently and with little fuss.

Sponsored by AWS.

Biting the hand that feeds IT © 1998–2021