This article is more than 1 year old
Open source databases: What are they and why do they matter?
We trawled through the licensing terms and spoke to the vendors so you don't have to
Feature For developers, there is no debate. The future of the database is open source. A glance at the 2022 Stack Overflow survey of around 70,000 code-wranglers shows nearly all pros use one of the two leading open source RDBMSes, PostgreSQL (46.5 percent) or MySQL (45.7 percent), although they use other systems as well.
Oracle, which built a global software empire starting with an RDBMS, is only used by about 12 percent of developers, while Db2, the IBM data workhorse used by banks and global retailers, is only used by 2 percent.
There is no question that the leading edge is open source – the people who build new systems are making it so by their choice. The question is why they are achieving dominance among devs.
Peter Zaitsev, CEO of database consultancy Percona, was an early employee of MySQL AB under the leadership of original open source database author Michael "Monty" Widenius. To Zaitsev, it is a question of economics in the startup scene of the early Noughties.
"If you look at Oracle and Db2, they can be very, very expensive systems. In the early 2000s, just after the dotcom era, the new generation of startups, starved of capital, needed but could not afford Oracle, Db2 or SQL Server," he says.
But in going with open source databases, this new generation of startups – Facebook, Uber, and Google among them – began to find they could adapt the system to their own needs, contributing to the open source code, while benefiting from development elsewhere in the community.
"This is permissionless innovation, and your ability to really customize and improve the software with the community is very important," Zaitsev says.
Fast-forward a decade, and this cohort of startups have – as well as attracting billions of users, drawing the attention of financial markets, and fascinating digital transformation gurus – begun to dominate the mindshare of web-native developers.
"The startup developer culture started to permeate through a whole ecosystem because people are looking at the approach Google, Airbnb or Uber might take," Zaitsev says.
"That gets mindshare of the database space. It's moved to open source databases. You will probably be hard-pressed to find some really cool system based on Oracle as a backend database. It might be very important but very boring in the belly of an enterprise and government agency. That is all well and good, but not what developers aspire to."
PostgreSQL and MySQL have more than 50 years of development between them, but a new generation of open source databases have appeared on the market and provoked intense debate around their approach to the open source model.
MariaDB, a fork of MySQL, and CockroachDB, a distributed RDBMS, adopt what they call a Business Source License (BSL). This is a "new alternative to closed source or open core licensing models," according to MariaDB's definition. The source code is always publicly available and non-production use of the code is always free, and the licensor can also make an Additional Use Grant allowing limited production use. Source code is guaranteed to become open source at a certain point in time.
Meanwhile, MongoDB, the popular document-based NoSQL database, offers the Server Side Public License (SSPL) v1.0 which requires that enhancements to MongoDB be released to the community. The restrictions mean a company cannot offer MongoDB as a managed service to other users.
Neither SSPL or BSL meet all criteria for open source software set by the Open Source Initiative (OSI).
Zaitsev says the approach of these firms is partly a result of investors gravitating towards the idea of open source.
Cloudera adopts Apache Iceberg, battles Databricks to be most open in data tablesREAD MORE
The perception of a need to fend off enterprise wolves
"Money can create reality. That can be a danger for open source. There are regions of the open source movement that are kind of romantic and want to change the world. Now, commercial open source is about making money. Yeah, people think open source is good: that is something that we learned over 20 years. So now a lot of those new interests are trying to redefine what open source software is, which allows them to protect themselves, but still be kind of open source. But that is the opposite of open source, which should be like science – give it away, let everyone be able to build it and use it."
In this way, users of open source software have the freedom to end relationships with their suppliers and contract other companies to develop or support their systems while keeping ownership of the original code, avoiding so-called vendor lock-in, Zaitsev argued.
But the companies advocating these licenses do so to defend themselves against the dominant cloud hyperscalers who have phenomenal market reach and influence with developers. The database companies fear that if they adopted the Free and Open Source Software (FOSS) approach, the likes of AWS, Google Cloud, and Microsoft Azure would simply copy their code and commercialize it as a managed service, as AWS, for example, has with its fully managed Relational Database Service, which provides both MySQL and PostgreSQL systems in this model. They would starve the main companies behind the open source software of revenue, they argue.
Gartner veep Merv Adrian, a data management analyst, pointed out in a recent blog about commercial viability of open source databases that the revenue accrued by cloud service providers from their open source database-as-a-service (DBaaS) products may well exceed that of the independent vendors combined.
MongoDB, Cockroach Labs, and MariaDB argue the model they have adopted allows the important features of open source to survive in this commercial environment. In this article, we gave these firms the opportunity to expand on these views, while also offering a platform for companies that go with a more permissive approach to open source, including Yugabyte, DataStax, and PostgreSQL company EDB, to defend their position. Oracle turned down the opportunity to talk about MySQL.
Just as a counterpoint, we invited a relative newcomer to the analytics space, Exasol, to comment on why it chose a proprietary model.
We also wanted to discuss why open source matters in modern development, and why it might not. To kick off we turned to EDB, a commercial company that supports and contributes to PostgreSQL, as well as providing some proprietary tools and a DBaaS.
EDB: Community control and freedom
Arguably the oldest of the major open source databases, Postgres was first proposed by Michael Stonebraker and Lawrence Rowe of UC Berkeley in 1986 [PDF] as a successor to Ingres. One of its founding goals was extensibility for data types, operators, and access methods. Under the OSI-approved PostgreSQL License, the object-relational database has proved extensible in support of new data types such as graph and JSON.
As well as being the most popular database among developers in the 2022 Stack Overflow survey, it sits fourth in the DB-Engines ranking after steadily climbing for a decade.
Marc Linster, CTO at PosgreSQL consultancy and contributor EDB, says the system's strength was not just in the permissive licensing, but in the diversity of contributions that attracted developers.
He made the distinction between "captive open source" projects in which the vast majority of contributions come from a single company, and a "true community open source" with contributions from software vendors, users, and other interested parties.
"If you're looking at Linux, or you look at PostgreSQL, you see that there is a vibrant community behind it. PostgreSQL is collaboratively developed by EDB – yes – but also VMware, Fujitsu, NTT, Microsoft, Amazon and so on. There's a whole slew of companies who are invested in this," he says.
"You have other open source software technically having an open source license, but they were really captive to a single company – let's say MongoDB, for example. If you're a community-driven process, there's a lot more innovation because in those dynamics inside the process, there is a lot of push and pull to get to innovation quickly. And there's also a lot of quality control both from a design perspective and from an execution perspective, because reviewing, integrating into the code, all that stuff happens in public and it's completely exposed.
"The community-driven process is also what protects open source from suddenly having artificial licenses like BSL or other licenses introduced because the community behind it would not accept that."
While the core PostgreSQL project is independent of EDB, the company sells its own proprietary software to help secure and manage the database.
Linster explained that these elements did not go to the open source project, because while some users might require them, the community as a whole does not.
"We do some things that the community would not. It would not say it needs password profiles like Oracle, for example, inside the database. The Postgres community is not likely to ever do, but customers often have needs that are very specific," he says.
Customers could have "everything that's in Postgres all the time" but also get features that would not be not acceptable in the PostgreSQL community, such as Oracle compatibility, which EDB sees as a "standalone competitive advantage," Linster says.
DataStax: Nobody likes to walk into a trap
Apache Cassandra was first developed as a decentralized structured storage system by Facebook engineers Avinash Lakshman and Prashant Malik to support the social media company's inbox search feature. Facebook released Cassandra as an open source project on Google Code in July 2008 and by 2010 it was a top-level Apache project.
It is backed by commercial company DataStax, which offers a paid-for DBaaS, but the idea of an open source project remains vital to the company because developers want freedom and flexibility in the cloud, says Patrick McFadin, Datastax vice president developer relations.
"It got really interesting because of the clouds. We got into this world where there are a lot of projects and companies that started doing this fake open source where you could say it worked on your laptop, but anywhere else you've got to pay.
"Nobody likes to walk into a trap. [Open source] is still really relevant for developers because they want to have that freedom of choice and the freedom to scale. If you look at all the major open source databases, including Cassandra, MySQL, and PostgreSQL, they are doing really well. There's an enterprise version of it, which you can pay for, and there's a managed service, which you can pay for. But, also, there is a free version."
Db2, where are you? Big Blue is oddly reluctant to discuss recent enhancements to its flagship databaseREAD MORE
McFadin added: "If you look at DataStax customers, they employ all three. We have really large customers that will buy our enterprise edition, they will use our managed service, and then they will have a lot of open source Cassandra all in the same environment. They get this psychological safety of knowing they can move around based on their current business. You can't do that without an open source database."
For example, Amazon has its own key-value database, DynamoDB, which it offers as a service.
"You can't run DynamoDB outside of Amazon, and it has a very certain API. So open source is a lot more than just software now; it's also the standards, the APIs or the query language, that sort of thing," McFadin says.
He says that DataStax would release any kind of extension to its cloud service to the open source community because there is divergence between the APIs on its paid-for Cassandra iterations and Apache Cassandra. This meant customers were free to walk away from any of its paid-for systems without a significant rewrite of their applications, he argued.
"Customers can and do walk away sometimes. The first thing I think is, 'Oh, that's too bad.' But that's not the right thing to say. It's more like, 'I'm glad you're still using Cassandra and I'll see you in the community' and that's the right answer," McFadin says.
Yugabyte: It's all about the DBaaS
YugabyteDB has been one of the distributed RDBMS vendors grabbing attention over the last couple of years. Others earning the dubious moniker of NewSQL include CochroachDB and to some extent MariaDB.
Yugabyte is sort of a double-decker database. It is inspired by Google Spanner underneath and compatible with PostgreSQL on top, with the aim of creating a highly available, scalable distributed database.
In October last year, the company raised $188 million in a Series C funding round, which valued the business at around $1.3 billion.
YugabyteDB is available under the Apache 2.0 license and runs on Kubernetes, VMs, and bare metal across private, public, and hybrid cloud environments.
- Teradata takes on cloud-native rivals with data lakes, MLOps
- Cloudera launches SaaS platform for the lakehouse crowd
- Ant Group's in-house DB set for global release, including Raspberry Pi edition
- Amazon takes RDS performance, scalability and durability to the next level
Founded by former Facebook engineers Kannan Muthukkaruppan, Karthik Ranganathan, and Mikhail Bautin, the company came out of stealth in 2017 but did not adopt Apache 2.0 until 2019. The reason why exposes some revealing arguments about open source databases.
Speaking to The Register, Ranganathan, now CTO, says all the founders had wanted to be fully open source from the start, but investors convinced them otherwise.
"The investor community was pretty clear that open source is a dying business model. So we decided that fundamentally we're going to be an open source company, but because their investors say fully open source is a losing game, we went open core, which meant that our database would have 80 percent of its value open and 20 percent closed," he says.
However, after the software gained traction with early customers, it decided to change its mind.
"What we realized – besides figuring out what we needed in the database – was that without the DBaaS side of it, this was like dead on arrival," Ranganathan says.
The revelation meant that being competitive on the DBaaS was as important as being competitive on the database, and customers wanted the database to be open source. "We told customers we were like PostgreSQL but distributed, and they said 'but PostgreSQL is incredibly open'," Ranganathan notes.
Although the vendors had feared that – if they went fully open source – AWS or other cloud providers would eat their lunch by providing their own DBaaS, that misses the point, Ranganathan says.
"If – as the primary vendor owning the project – your DBaaS offering is good enough, then it should be easy to make a case for people to use that managed service. AWS can only do it for AWS; we can do it on any cloud, including off the cloud, because we don't have a particular preference; we actually want it to be available everywhere. AWS is only successful when they own the entire infrastructure data, we are successful even if the customer wants to own it."
MariaDB: Open source, but for a commercial backbone
MariaDB was sharded out of MySQL, the open source relational database that dates from 1995. MySQL had been part of Sun Microsystems since 2008, but when Oracle bought Sun in 2010, MySQL co-founder Widenius forked the code to a new open source database, MariaDB.
MariaDB has adopted the Business Source License (BSL), which critics contend is not truly open source because it does not allow users or developers to do what they want with the code. Under BSL, the source code is always publicly available, non-production use of the code is always free, and the source code is guaranteed to become open eventually.
The licensing model only requires a commercial license by those who make production use of the software, which is typically indicative of an environment that is delivering significant value to a business, the company says.
Although BSL does not comply with OSI definitions, MariaDB does stick to the spirit of open source, while ensuring it becomes a sustainable company.
Michael Howard, MariaDB CEO, told The Reg: "Open source started as an ideology and a way of life. It's about free software and non-commercial business models. In today's world, where you have the largest and most powerful companies in the world reusing open source for their own purposes in commercialization efforts, the realization is that open source cannot stand on its own. Independent independent and dependent independent depends upon a strong commercial backbone."
While BSL was "a nod to open source," it also supported a strong commercialization platform for a given product. "It's very different than let's say Apache BSD or GPL (used by Linux), which have overhanging requirements and encumbrances to require people to take a contribution and give it back to the repository automatically to let anyone use it," Howard says.
In this way, it has prevented the world's largest cloud companies from offering commercial software as a service around the MariaDB database. "The business model in the hyperscalers' business has most certainly compelled new license models, BSL included. However, when it was first designed, it was more about the entrepreneur building up a virtuous business," he says.
While it has an arguably more restrictive license than other open source projects, MariaDB probably attracts more contributions than any open source database, including MySQL and PostgreSQL, which are both older, Howard says.
Database from the 1980s needs time travel says authorREAD MORE
But unlike proprietary software, developers, whether from users or support companies, were free to view the source code, giving users support options once their database was moved to the more permissive license.
"Huge institutions look upon open sources as affording them optionality that they can work with the vendor, you know, like MariaDB, or if they have to, they can fix their own bugs. They can hire people to fix a bug and not have to be dependent on the commercial project. That's a profound distinction," he says.
CockroachDB: History in open source, with a commercial future
The company's database, CockroachDB, is a distributed RDBMS wire-compatible with PostgreSQL backed by a key-value store, which is either RocksDB or a purpose-built derivative called Pebble.
In December 2021, the company hit a nominal $5 billion valuation with a $278 million Series F funding round. It counts Comcast, eBay, and Nubank among its customers.
Like MariaDB, CockroachDB leans heavily on the BSL. The source code is available, but users may not use CockroachDB as a service without an agreement with Cockroach Labs. Other core features are subject to the CCL, the Cockroach Community License, under which certain features are either paid for or free, but the code is available. A couple of years after release, the code moves over to the Apache 2.0 license.
Cockroach says BSL is not certified as an open source license, but most of the OSI criteria are met.
Despite the mixed model between open source and commercial models, Jim Walker, principal product manager, says the company was steeped in the open source movement. Earlier releases had been under the Apache license, while co-founders Kimball and Mattis were behind the popular open source GIMP photo manipulation tool, which resulted from a college project.
But to Walker, open source is about more than the license alone. "I am not saying it's unimportant, absolutely not. If you're going to do a binary 'What is open source?' based purely on that one factor? OK, great, I understand open source or not. However, to me, open source is a lot more than that. There's a community of people that are involved in this. They love to contribute and talk about [computing problems]. Number one, it's about people.
"Number two, it's about code. It's about open source code repositories. If you are working on a PhD in distributed systems, you can look at the code repository for Cockroach Labs. Right now. It's completely open. Anybody can look there."
Critics point out that for companies employing the mixed model, their projects often end up dependent on a single vendor, which can govern the direction of the project regardless of the views of the wider community. Diversity in contributions is a healthy sign.
But in terms of contributions, Walker admits it is mostly all Cockroach Labs. So why not go the whole hog and protect its intellectual property by making the software proprietary?
"To me, that's a dead end. [Consider] the rate at which the world has evolved over the past three to five years because of open source. I'm not talking about just our software: the world has changed because the efficiencies and the scale that we can get out of our systems. We are absolutely a citizen and a part of that," Walker says.
He also argued that CockroachDB offers the freedom for users to make choices about their software in the future: they could either migrate the data and schema onto a free version of the database and get someone else to host it, or run it themselves.
But the pressure on developers is likely to prevent a move away from the paid-for managed service, Walker says.
"People just don't do that. Nobody wants to do that. Part of the value of CockroachDB is basically, I don't have to hire a bunch of people, I don't need as many resources to run the database. There's this emerging desire in the market for low-ops and eliminating the amount of work I have to do to make things run."
MongoDB: The risk of hyperscalers cutting us out
With its first release 13 years ago, MongoDB is sort of the granddaddy of the NoSQL movement. It is also the highest ranked among the databases that have eschewed the relational model, at least according to DB-Engines.
The company IPO'd in 2017 and is currently valued at something like $22 billion.
MongoDB offers a Server Side Public License for all versions released after October 16, 2018, while the Free Software Foundation's GNU AGPL v3.0 applies to MongoDB software released before that date.
The SSPL was introduced by MongoDB and requires that enhancements to MongoDB be released to the community. Where this is unacceptable for legal reasons, commercial licenses are available with MongoDB Enterprise Advanced.
Andrew Davidson, senior veep for product management, says: "Open source has been a critical part of MongoDB's success. There's no way to become a widely adopted developer solution in this day and age without having an open source first philosophy."
But over the years, the company began to perceive the risk of hyperscalers taking the product that MongoDB had invested in, and delivering it as a service, cutting MongoDB out of the mix, he says.
"The SSPL was very specifically envisioned to clarify an ambiguity that you could not deliver MongoDB as a managed cloud service, targeting the hyperscalers. The Community Edition, which is licensed with the SSPL, gives our users the freedom to run it wherever they would like and use it for any application that they would like. The only restriction on that is that they can't sell that as a managed cloud service," Davidson says.
But they can build their own software on top of MongoDB, and offer that as a service, he says.
In the years following MongoDB's early development, the company began to realize that the future was in managed services so put in the work to create Atlas, its DBaaS, which it has embellished with add-on features addressing analytics, for example.
One of the questions critics raise about mixed-model open source is that of future freedoms: do customers have the ability to walk away from the vendor once they have built the application? Can they replicate the database and host the application elsewhere?
In the case of MongoDB Atlas, the answer is "maybe" and "it depends."
"We've expanded Atlas beyond the OLTP database to deliver triggers, a data lake offering, data federation, API servers and app services and so on...Essentially, you can totally move that. If you're using any of those ancillary services, then you'd have to build your own way of experiencing those types of use cases were you to leave Atlas."
While open source advocates argue that projects benefit from a diversity of contributors, as PostgreSQL does, Davidson argued that contributors such as Amazon, Microsoft, and EDB have businesses with an interest in the success of the project.
MongoDB may be more of a single-vendor project, but this allows it to "obsess over developer experience and making an elegant integrated experience in a way that is a little bit different than kind of disjointed consortia model," Davidson says.
MongoDB loses its mind with marketing budget movie mania: Yep, it's choose-your-own-adventure Hackers with drop-down menusREAD MORE
Whatever the interpretation of open source, it shares a core value across the board, which is trust in the code. "You need to be able to know that you can go in and with a microscope [and] inspect any level of how this thing is built.
"Sure, you may not actually do that, but knowing that countless others – PhDs, students, academics, and industry professionals – [are doing] that is so critical. We're talking about the heart of people's applications, which is the crux of their business. It's all about knowing that what you're building on is something that is so community validated."
A devil's advocate might also point out that banks around the world run on proprietary Oracle and Db2 and they seem to be fine. But these systems have years of engineering from very big companies in them. The open source model has allowed new databases to be reliably introduced without such backing.
Exasol: Proprietary goes where the community cannot
Finally, observers might be forgiven for thinking all modern databases are based in some way on open source software, but that's not quite the case. Germany's Exasol has developed an in-memory analytical system which is used by Dell and sports giant Adidas. It has long claimed leadership in the TPC price/performance benchmark.
CTO Mathias Golombek told The Register there were specifics to do with parallel hardware that meant the open source route would not work for their core product.
"Open source systems are very popular, especially with transactional databases," he says. "In the more dynamic data analytics space, Exasol is embracing open source software in various ways; however, our core database engine remains proprietary as, to date, the right developer community that has access to MPP hardware setups and is specialized on in-memory computing simply doesn't exist.
"To get the best of both worlds, and support open source as a fundamental concept in the software industry, Exasol has added dozens of open source projects by leveraging an agile community to help create various integration projects, such as data science – through our language containers – or data virtualization, through our virtual schemas.
"Over time, we have learned that it's less about open source versus closed source, but rather whether the software architecture is relevant for the data problems. Open source business models have failed enterprise software vendors who have not come to this realization." ®