Evolutionary database design

'Technically trivial'


Evolutionary Database Design (EDD, aka database refactoring) is a controversial topic in the software development world. It’s a sore point among some DBAs because it promotes the making of changes to schemas that contain live customer data, sometimes with more than one application using the same database.

Given this controversy, and the resistance to iterative techniques among many database specialists, it seemed fair to pitch some of the commonest criticisms to Scott Ambler, co-author with Pramod Sadalage, of the book Refactoring Databases: Evolutionary Database Design.

Headshot of Scott Ambler.

Scott Ambler at Machu Picchu.

Matt Stephens, for Reg Developer: Hi, Scott. First off, could you tell us why you feel there's a need for an evolutionary approach to database design?

Scott: First and foremost, there’s nothing special about databases. Just like evolutionary approaches to design work incredibly well for other aspects of software development they also work well for databases. As Pramod and I show in Refactoring Databases, it's actually a technically trivial thing to evolve an existing relational database schema, even when hundreds of disparate systems access it.

Unfortunately, the traditional data community assumed that evolving database schemas is a hard thing to do and as a result never thought through how to do it. Worse yet, this assumption led to other questionable assumptions, such as the need to model in detail the data aspects of a system early in the lifecycle. The end result is that many data professionals have pretty much missed out on the fundamental ideas and techniques which application developers pretty much take for granted these days. They've got a lot of catching up to do.

Reg Developer: in a sense, the database schema is a reflection of the problem domain: so once designed, it generally shouldn't need to change very much – as long as it's been designed thoughtfully, of course, and grew out of collaboration between the customer, end-users, business analysts etc. Given this, is there really a need for EDD in a software development project?

Scott: I’ll let the existing state of production databases speak to that. How often do you find databases with tables that have columns that are no longer being used, if they ever were? Or columns that are being used for several purposes because it was too hard to add new columns when they were required? Or tables with data quality problems? The Data Warehouse Institute (TDWI) estimated a few years ago that data quality problems result in a $611 billion (yes, billion) loss every year for US-based organizations. This tells me that in practice that we're not very good at getting the design right up front and/or the requirements do in fact change over time. It's incredibly clear that organizations need a safe way to fix existing database problems, and the approach presented in Refactoring Databases is the only viable one on the table. Traditional approaches to database design clearly aren't working out for us.

Reg Developer: Do you see database refactoring being applied to a fine-grained development approach such as TDD, so you'd start with a test, create a table, write another test, add a column, a constraint, etc?

Scott: Absolutely, and in fact I have an article in the May/June 2007 issue of IEEE Software which addresses this specific topic.

Reg Developer: I spoke to a few DBAs and developers prior to this interview; and the main concern that cropped up over and over again was the fact that you're putting a system into production, then making changes that require a schema update and/or data migration; and this may happen as often as every couple of weeks, depending on the release cycle.

Scott: This is why these people need to read Refactoring Databases. First, you develop and test your changes in a developers' sandbox and only when you believe that the changes are working do you promote into higher level sandboxes, including production. Developing directly against a live database is clearly an incredibly risky proposition, so don’t do that.

Second, you do in fact need a viable strategy for evolving an existing production database, as the TDWI figures show. The strategy is to have a transition window during which both the original schema and the fixed schema exist in parallel in production and the database automatically keeps everything up to date. This window may be months or even years long, and the purpose is to give the development teams responsible for the accessing systems time to update their systems and deploy them into production.

Once the systems are updated, the old schema is removed. This is basically the deprecation strategy which Sun Microsystems uses to evolve the Java Development Kit (JDK), and there’s hundreds of thousands of systems which use that. In the book, we provide solutions, including full source code, for over 60 refactorings which take this approach.

Reg Developer: A couple of issues spring to mind with this approach: First, there's the problem of duplicate data. At some point, the data from the old system will need to be migrated – merged in – with the data in the new system, which in the meantime has been accumulating new data. Some of the data in the old system may be stale.

Scott: In our book we show how to deal with this, so it isn’t a problem. Each refactoring includes the code to implement refactoring, including data migration code, as well as how to remove the old schema and scaffolding code to keep the data in sync.

Reg Developer: And the second issue: if you're deploying an updated schema with each new release, and each release takes place every couple of weeks or whatever, isn't that an awful lot of disparate schemas that pile up over time? Keeping track of which application – or which version of each application if it's a thick-client install – uses which version of the schema could end up being something of a maintenance nightmare.

Scott: In practice this doesn’t seem to be much of a problem. If you’re releasing new versions quickly then the changes are pretty small anyway. If one refactoring clobbers another one, for example you rename column X to Y then to Z, you can easily rework your transition schemas to reflect this. It isn’t perfect but if you’re smart, you can reduce the complexity. In the book, we also talk about taking an Agile Model Driven Development (AMDD) approach where you do a bit of up-front modelling. This identifies the overall landscape pretty early in the project and reduces the risk of major refactorings.

Existing database development tools could also do with significant improvement. Over the next few years as EDD gains in popularity, we’ll start seeing better tools out there both commercially and in open source.

As I said earlier, technically database refactoring is trivial and the real challenges are all cultural. The data community decided years ago that EDD is difficult to do and now struggle to understand how it can be done. It’s going to be a huge effort to get existing data professionals to recognize and then overcome many of the false assumptions which their “thought leadership” have made over the years. See this article.

Reg Developer: Some of the refactorings described in your book may have questionable applicability, when applied to a live system. For example, “Introduce Common Format” involves applying a consistent format to all the data values in an existing table column, its motivation being to simplify external program code. Given the issues with refactoring a live schema, isn't this what an OO data access layer is best suited to doing?

Scott: To implement this refactoring you would need to choose a format already supported by the column. For example, if there are 7 different formats, choose one of those formats because all of the accessing systems must already be able to handle it. The next step is to cleanse the existing values to put them all into the common format. You would also need to put code in place, likely a database trigger, which would ensure that any updates to the column are done in the common format. You’d leave this trigger in place during the transition period; while the accessing systems are updated to write only to the column in the common format.

Encapsulating access to a database definitely helps because it reduces coupling to your DB schema, this is also something we talk about in the book. However, what happens if some systems don’t go through your common access layer? What if you have some COBOL systems, some C++, some Java, and some APL systems which access the same database? Gut feel tells me that they’re not going to use the same OO data access layer.


Other stories you might like

  • To Washington's relief, GlobalWafers to spend $5 billion on Texas plant
    Cash had been burning a hole in company's pocket after deal to buy Siltronic fell through

    Taiwan's GlobalWafers announced on Monday a new use for the $5 billion it first earmarked for a purchase of Germany's Siltronics: building a 300-millimeter semiconductor wafer plant in the US state of Texas.

    Construction on the facility – which will eventually span 3.2 million square feet – is expected to commence later this year, with chip production commencing by 2025. The plant will sit in the city of Sherman, near the Texas-Oklahoma border, where it is slated to bring in 1,500 jobs as production climbs towards 1.2 million wafers per month.

    GlobalWafers is the world's third largest producer of silicon wafers and Sherman is already home to its subsidiary, GlobiTech.

    Continue reading
  • Tencent admits to poisoned QR code attack on QQ chat platform
    Could it be Beijing was right about games being bad for China?

    Chinese web giant Tencent has admitted to a significant account hijack attack on its QQ.com messaging and social media platform.

    In a post to rival social media platform Sina Weibo – a rough analog of Twitter – Tencent apologized for the incident.

    The problem manifested on Sunday night and saw an unnamed number of QQ users complain their credentials no longer allowed them access to their accounts. Tencent has characterized that issue as representing "stolen" accounts.

    Continue reading
  • Carnival Cruises torpedoed by US states, agrees to pay $6m after waves of cyberattacks
    Now those are some phishing boats

    Carnival Cruise Lines will cough up more than $6 million to end two separate lawsuits filed by 46 states in the US after sensitive, personal information on customers and employees was accessed in a string of cyberattacks.

    A couple of years ago, as the coronavirus pandemic was taking hold, the Miami-based biz revealed intruders had not only encrypted some of its data but also downloaded a collection of names and addresses; Social Security info, driver's license, and passport numbers; and health and payment information of thousands of people in almost every American state.

    It all started to go wrong more than a year prior, as the cruise line became aware of suspicious activity in May 2019. This apparently wasn't disclosed until 10 months later, in March 2020.

    Continue reading
  • India extends deadline for compliance with infosec logging rules by 90 days
    Helpfully announced extension on deadline day

    India's Ministry of Electronics and Information Technology (MeitY) and the local Computer Emergency Response Team (CERT-In) have extended the deadline for compliance with the Cyber Security Directions introduced on April 28, which were due to take effect yesterday.

    The Directions require verbose logging of users' activities on VPNs and clouds, reporting of infosec incidents within six hours of detection - even for trivial things like unusual port scanning - exclusive use of Indian network time protocol servers, and many other burdensome requirements. The Directions were purported to improve the security of local organisations, and to give CERT-In information it could use to assess threats to India. Yet the Directions allowed incident reports to be sent by fax – good ol' fax – to CERT-In, which offered no evidence it operates or would build infrastructure capable of ingesting or analyzing the millions of incident reports it would be sent by compliant organizations.

    The Directions were roundly criticized by tech lobby groups that pointed out requirements such as compelling clouds to store logs of customers' activities was futile, since clouds don't log what goes on inside resources rented by their customers. VPN providers quit India and moved their servers offshore, citing the impossibility of storing user logs when their entire business model rests on not logging user activities. VPN operators going offshore means India's government is therefore less able to influence such outfits.

    Continue reading

Biting the hand that feeds IT © 1998–2022