Fit for purpose: The case for the purpose-built database
One size does not fit all
Sponsored Remember when the only database in town was relational? Things have changed in 20 years. Today, the venerable old relational database management system (RDBMS) still presides, but the market is also filled with new database types designed for different kinds of jobs.
Database concepts predate the RDBMS, in the form of hierarchical and network databases, and even further back in the form of punched card collections. But it was really E.F Codd's relational concept, published in 1970, that ushered in the era of modern database computing. His concept of tables and rows separated the logical data model neatly from the underlying physical storage, and led to a flurry of database engines from the mid-seventies onwards.
The RDBMS was perfectly adequate for most applications for decades, but the seeds of change had already been planted. Just a year before Codd published his first paper, the Stanford Research Institute and UCLA exchanged the first ever internet message. That would eventually change everything, creating an internet that would morph computing forever, increasing applications' scale and scope.
Over the last few decades, the Internet ushered in new data and speed requirements that made relational systems less appropriate for many applications. Today, many applications need to work with terabytes of data, supporting millions of global users. Traditional relational systems have problems coping with that scale while still maintaining performance.
Hyperscale operators were among the first to notice as relational products grew thin at the seams. Amazon, which had relied on Oracle's RDBMS in its early days, began noticing the strain. Amazon’s relational databases began hitting their limits in 2004 as the ecommerce giant's transaction volumes ballooned, explains Edin Zulich, senior solutions architect (SA) manager and NoSQL database specialist at AWS.
"A closer look at the data access patterns and how those databases were used revealed that in most cases, we used a fairly straightforward key-value access pattern," he recalls. "This gave rise to the idea that maybe we can look into creating a more scalable database that would work well for these use cases."
The company began developing its own key-value database, Dynamo, in 2004 after Oracle started running out of steam. It posted a paper documenting its experiences in 2007. “Then, it continued to refine the database internally before releasing it as the Amazon DynamoDB key-value database service in 2012.” to “Then, it continued to refine the database internally before releasing Amazon DynamoDB in 2012, a key-value database service built using the same principles as Dynamo.”
Focusing on a key-value structure enabled AWS to break away from the rigid structure on which relational tables are based. "Data has weight," Zulich says, adding that it's harder to scale out systems that are organized that way. Why take the performance hit of complex joins when you could just store the records that you need together?
"It's basically about organizing data in a way that's efficient for your read and write patterns," he explains. "If I'm storing shopping cart data, I'll always make sure that a given cart's data is in the same location. That way, I don't have to send a request to several different nodes and then put that shopping cart together. That is what would happen if you did it with a relational database."
Deploying the right database for the job
With a range of databases suited for different use cases, there's no reason for companies to use just one for a job, Zulich says. Developers can tease out transaction types that can benefit from different database engines, even in the same application. That is in part due to the changing nature of development.
"We can build services in a highly modular, decoupled manner. This goes hand-in-hand with the concept of microservices as a way to build applications as opposed to what we now call monolithic applications," he says.
The challenge then becomes tying those databases together to get a single, aggregated source of truth. One way to tackle that problem is by using event-driven architectures that replicate data between data stores, Zulich continues. In DynamoDB, developers can tap into a transaction log and replicate data in real time as changes occur.
AWS has also built support for DynamoDB, along with its Amazon RDS and Amazon Aurora relational sources, into the preview release of AWS Glue Elastic Views. This data integration service watches for data changes in these source data stores and then combines this into a single view using SQL queries that update a target database.
As cloud-based apps at scale have grown, so has demand for DynamoDB. Beyond transaction scalability, the database is geared to handle large numbers of connections. Anything that serves large numbers of users at the same time, from mobile banking apps through to customers like Snapchat, stand to benefit from making the change, Zulich explains. Key-value stores also serve applications with volatile loads, like retail, which might see large volume spikes around certain periods like holidays.
An explosion of tailored database types
Key-value databases are just one kind of alternative data store that has emerged to complement traditional relational engines. While relational schemas are great for structured tabular records with well-defined field types that don't change over time, they aren’t the best choice for the data types used in many modern applications. Applications ranging from geospatial to text mining have tested the limits of Codd's model.
At least as far back as the early nineties, RDBMS vendors like Sybase tried to support other data types with add-ons to the relational engine. Eventually, the idea of a database purpose-built for specialized use cases would gain traction, especially as people ventured into use cases beyond simple create-read-update-delete (CRUD).
This focus on specialized database models has opened up new possibilities for the database sector, with many companies launching NoSQL databases that forewent the relational model for different approaches. Amazon Web Services (AWS) has been among them, launching managed database engines for a variety of use cases. Here are some of the most common types:
Document databases (e.g. Amazon DocumentDB) – Useful for catalogs, content management systems, user profiles, personalization, and mobile.
These store their data in flat-file documents, typically in JSON format, which is the lingua franca for many web app developers when shunting data via APIs. Document databases are also far less rigid than relational models when it comes to updating schemas.
Graph databases (e.g. Amazon Neptune) – Useful for fraud detection, social networking, data lineage, and knowledge graphs Beloved by social networking companies everywhere, these databases work with data sets that feature large numbers of connections between their nodes.
They document the connections between records (nodes) as edges, enabling them to capture relationships at scale. They're good for queries that span complex networks.
Time series databases (e.g. Amazon Timestream) – Useful for DevOps, application monitoring, industrial telemetry, and IoT applications.
These are tailored for streaming, timestamped data, which could arrive from sources including IoT sensors or high-volume financial applications. They will usually write that data in append-only format, and focus on queries referencing time intervals.
Ledger databases (e.g. Amazon Quantum Ledger Database) – Useful for finance, manufacturing, insurance, HR and payroll, retail, and supply chains
These systems exhibit many of the characteristics of distributed ledger technology, ensuring accurate history through immutabile transactions that can be verified, along with transparency and high scalability.
Wide-column databases (e.g. Amazon Keyspaces) – Useful for transaction logging, IoT applications, finance and other applications where high availability is critical.
Apache Cassandra is one of the best-known wide-column databases. This data model supports columnar databases, in which data is stored using families of columns rather than traditional relational database tables. Wide-column structures excel at storing large amounts of data, and also support distributed database structures where data is replicated between different servers for low latency and high availability.
Data in columns can be excluded on a per-record basis and easily added to database schemas when the need arises. They are suitable for very fast database queries, and don’t suffer from the same speed constraints that queries on non-indexed relational columns do.
Amazon Keyspaces is a managed database that supports Cassandra applications.
In-memory databases (e.g. Amazon Elasticache) – Useful for low-latency applications that need to operate in real or near-real time, including real-time bidding, caching and leaderboards.
These purpose-built databases are slightly different to the others in that they aren't tied to any one database model. Their advantage comes in the form of sub-millisecond latency that they achieve by avoiding disk storage.
Using a database that's right for the job can yield results that aren't easily achievable with conventional relational models. One example here is Neptune, Amazon's graph database.
Marinus Analytics, a company that uses technology solutions to help law enforcement victims foil human trafficking, deployed the engine in 2019 to enhance its Traffic Jam software that spots patterns in suspected trafficking activity.
Relational databases only allow queries one degree of separation away at a time, making it difficult to traverse chains of relationships spanning multiple entities (in this case, classified advertisements with a high likelihood of trafficking activity). Marinus Analytics was able to achieve a deeper querying ability and faster access model with Neptune, combined with a faster access model that cut query times from minutes to seconds, enabled law enforcement to find new connections between criminal groups. The software helped identify over 3,800 human trafficking victims.
Of course, the rise of nonrelational purpose-built databases doesn’t mean the relational database is going anywhere, as it’s the best fit tool for many workloads, including those requiring transactional consistency. The rapid rise in popularity of open source relational databases like PostgreSQL are testament to the staying power of the RDBMS. AWS offers a cloud-native relational database in the Amazon Aurora service, which has MySQL and PostgreSQL-compatible editions, serverless capabilities, and a distributed storage system that give it commercial database bona fides with open source flexibility.
Redefining the database for different types of use cases is a little like creating cars for specific kinds of users. You wouldn't use an SUV for road racing, and you wouldn't take a F1 race car for a monthly grocery run. Thanks to ongoing evolutions in database technology, you can have both. And thanks to the cloud, you can provision them whenever you need them.
Sponsored by AWS