All hail the return of the general-purpose database

Distributed SQL and cloud computing have made relational more relevant than ever, says MariaDB

Paid feature In a technology industry as complex as this one, it pays to keep things simple where you can. Databases are no exception. So why are cloud service providers offering multiple databases for different tasks when one used to be enough? This article looks at how traditional SQL relational database systems have evolved to serve as simpler general-purpose platforms for applications in the cloud.

First, let's address the fragmentation of database portfolios. Customer expectations have evolved to the point where the traditional RDBMS is inadequate for many, warns Andrew Oliver, senior director of product marketing for MariaDB Corporation.

"Modern applications are always on," he says. "For many companies, there's no room for an outage when the system must survive not just faults, but also the rollout of new software, services, and features."

That wasn't a problem until web-scale applications evolved. An online service that can't scale effectively will suffer from availability issues when it hits capacity. If your online storefront shows time-out errors instead of last-minute holiday gift options, that's a problem.

The limitations of RDBMS scaling

The traditional RDBMS wasn't built with this kind of massive scale in mind. Vendors tackled this problem by scaling their databases up. This typically meant housing them in big-iron servers with memory architectures carefully fitted to the software. To scale up you replace your server with an even bigger server.

Scaling up was an expensive, vendor-bound proposition. Over time, the industry evolved to increase capacity more readily by scaling out. First, it developed distributed systems. Then, the cloud made things simpler still. It promised elastic computing and storage capabilities to handle the kinds of volatile workload that came with large-scale web applications. It's more difficult to dynamically scale capacity in a RDBMS running on hardware you've already paid for.

Reliability is also an issue. For all its promises never to fail, the cloud sometimes does, render databases offline. Cloud service providers sell the idea of multiple availability zones to support redundancy, but the traditional client-server RDBMS wasn't built for that.

"In 2021, my database runs on someone else's server in someone else's data center connected to a lot of infrastructure, and I can neither control or predict its future configuration," says Oliver. "I need software that doesn't require that I trust the underlying infrastructure."

One way to support RDBMS scalability in the cloud is using read replicas, with one node for writes, and any number of nodes for reads. That enables applications to scale, at least for read queries.

However, the changing nature of today's high scale applications has created a need beyond read replicas. Whether you're updating your location in real time or giving your data unwittingly to an advertising network, there are many applications where writes outnumber or equal reads.

The move to polyglot persistence

Companies reacted to these challenges by moving away from relational models. This often took a long time. Amazon famously began moving away from its Oracle-based RDBMS after the database began straining under the weight of so many queries, but didn't turn off its last Oracle server until 2019. It developed databases based around key-value stores, columnar architectures, and document structures. NoSQL vendors like MongoDB and Couchbase have long offered relational alternatives to the market.

Their creators tailored these systems for applications with needs for resilience, scale, and flexibility. The underlying implication is that each database was the right tool for the right job, otherwise known in database circles as polyglot persistence.

The problem with this emerging fashion is that NoSQL databases often gave up features that were baked into the venerable general purpose relational model. One of them was support for ACID transactions.

ACID is the gold standard for consistency models. It ensures Atomicity (all operations in a transaction must succeed, or none do), Consistency (all transactions leave the database structurally sound), Isolation (transactions do not conflict), and Durability (all transactions are permanent).

Many NoSQL databases focus on BASE consistency, which is more forgiving. Basic Availability means that the database works most of the time. Soft state means that stores (such as replicas need not be write-consistent all the time), which relates to Eventual consistency (those stores can update lazily, perhaps at read time). While most NoSQL databases added some form of ACID transaction later, they still don't offer the same guarantee or level of performance for consistency that traditional RDBMSs provide.

Many applications need ACID's stricter standards, along with some other things that NoSQL databases often leave at the door, such as the ability to use joins for efficient data storage. These more demanding applications include many back-office functions such as finance, where you want to ensure that everyone's account is balanced all the time.

Companies often bridge the functional gap between these two databases by creating customer-facing NoSQL systems while letting the RDBMS reconcile the mission-critical or financial data at the back end. That creates some complex architectures, though, while increasing the maintenance and support overhead.

Multiple databases of different kinds mean two sets of developer skills, along with glue code to get each database talking to the others. The more of these systems you have in your infrastructure, the costlier the skills and glue become.

The rise of distributed SQL

How can companies avoid getting themselves into that sticky situation? "What we really need is a database that can handle scale, survive multiple faults across cloud availability zones, and provide real consistency," says Oliver. In short, a general-purpose platform that supports cloud-based models properly without abandoning the advantages of the RDBMS.

This is where distributed SQL comes in. This technology is effectively a hybrid of SQL and NoSQL technologies, providing the regular structure of SQL-based RDBMS while supporting the distributed operations that NoSQL systems are known for.

Distributed SQL partitions and replicates relational tables behind the scenes, enabling them to run on multiple read and write nodes in a scale-out environment. MariaDB Xpand, a distributed SQL database, handles this partitioning automatically using a hash of a shard key to create index partitions and assign them algorithmically to nodes.

The product rebalances partitions automatically to maintain performance across all the available nodes and avoid "hotspots." That rebalancing also supports the addition and removal (accidental or otherwise) of nodes, always keeping the data consistent and available.

To ensure adequate replication and resilience, each partition has a set number of copies so that if one node disappears, another other replica is always there. This gives customers at least two replicas of every partition of every table, each running on different availability zones.

Distributed SQL in MariaDB SkySQL

MariaDB SkySQL, the cloud-based version of the open-source database, added support for distributed SQL in May 2021. In addition to traditional row-based indexes, it provides columnar indexes, which make it possible to be more flexible in data lookups while using less space. Traditional row-based indexes require a dedicated composite index for each specific combination of columns that it wants to use in a query. Columnar indexes can be combined in any way to create a variety of indexes, enabling developers to run performant ad hoc queries and intraday operational analytics on a transactional database. 

SkySQL also handles semi-structured documents, as many NoSQL databases do. JSON, the lingua franca of key-value and document database stores, has been part of the SQL standard since SQL:2016. The cloud database can store a JSON document within a table. It can retrieve a JSON document from a row via a SQL projection (a subset of columns within a row).

MariaDB has also added other functions to position its cloud database service as a general-purpose product that can handle things transparently for users. These include a database proxy that handles node failures independently of the client. It provisions a single IP for the application to communicate with across all nodes. Instead of throwing failed transactions back to the application, the database will handle transaction replay automatically behind the scenes. That eliminates error handling for failed nodes on the application side.

The benefits of a generic data platform

A SQL database encompassing these properties brings several benefits, not least of which is the scalability and performance of distributed database operation. Developers don't have to write different code to support cloud operations. They can use the SQL language that they're familiar with to handle their distributed databases but can still get into JSON storage and manipulation if they need to.

Simplifying the architecture of a distributed database in this way helps to decrease development time, and therefore overall costs. It gives customers a single source of truth that supports ACID-grade consistency without having to glue together a group of databases serving different applications. It does all these things while delivering the benefits of a scale-out architecture in the cloud.

For many companies, distributed SQL is the sweet spot between the old and the new. It gives them a trusted, veteran relational model that has powered the largest organizations, along with the advantages of cloud-based environments. 

In a world of many choices, distributed SQL and cloud computing have made relational more relevant than ever.

Sponsored by MariaDB.

Biting the hand that feeds IT © 1998–2022