The traditional wisdom for designing database schemas is to use a design tool (typically based on a UML or ER model) to construct an initial data model for one’s data. When one is satisfied with the result, the tool will automatically construct a collection of 3rd normal form relations for the model. Then, applications are coded against this relational schema. When business circumstances change (as they do frequently), one should run the tool again to produce a new data model and a new resulting collection of tables. The new schema is populated from the old schema, and the applications are altered to work on the new schema, using relational views whenever possible to ease the migration. In this way, the database remains in 3rd normal form, which represents a "good" schema, as defined by DBMS researchers. "In the wild," schemas often change once a quarter or more often, and the traditional wisdom is to repeat the above exercise for each alteration.
In a survey of 20 database administrators (DBAs) at three large companies in the Boston area, we found that this traditional wisdom is rarely-to-never followed for large, multi-department applications. Instead, DBAs try very hard not to change the schema when business conditions change, preferring to "make things work" without schema changes. If they must change the schema, they work directly from the relational tables in place. Using these tactics, the ER or UML model (if it ever existed) diverges quickly from reality. Moreover, over time, the actual semantics of the data tend to drift farther and farther from a 3rd normal form data model.
We term this divergence of reality from 3rd normal form principles database decay. Over time, decay becomes worse and worse, leading to rotted databases and ultimately to databases that are so decayed that they cannot be further modified. Obviously, this is a very undesirable state of affairs.
In our opinion, the reason for decay stems from the multi-department organization of large implementations. Hence, various pieces of the overall application are coded by different organizations, typically using ODBC or JDBC to specify the SQL in transactions. If one business unit needs to change the semantics of the database, it is exceedingly difficult to figure out what code from other departments needs to be changed and how extensive the required repairs are. In our opinion, this leads DBAs to change the schema in such a way that application maintenance is minimized, rather than making a change that maximizes the cleanliness of the data. Of course, the result of a different DBA cost function is database decay and rot.
Seemingly, database decay is a fact of life, which ultimately renders databases unable to be modified. There are three strategies that can counter database decay.
The first one is to construct defensive schemas in the first place. Such schemas are more resilient to subsequent changes than ones produced using the traditional wisdom. We have developed a methodology for such schemas, which will be addressed in an upcoming paper.
The second tactic is to write defensive application code. For example, one should never use Select * From Table-name, because it tends to make applications break if attributes are added or deleted downstream.
Lastly, in our opinion, it is a bad practice to let application groups directly code against an ODBC/JDBC interface. This is what is responsible for DBAs not knowing the impact of possible schema changes. Instead, we advocate requiring application groups to use a messaging interface to send higher-level commands to a DBMS. These messages are intercepted and turned into SQL in server-side code. Such an architecture localizes DBMS code that may need to be changed later on. Moreover, we have written a prototype system that can examine such code and determine if it needs to be changed as a result of schema evolution. In this way, we expect to lower the cost of schema changes, and perhaps slow down or arrest database decay. An upcoming paper details our prototype.
We are looking for "in the wild" database projects that are dealing with schema evolution that would be amenable to trying our prototype system. If you are interested, please contact Michael Brodie at firstname.lastname@example.org.
Michael Stonebraker is an adjunct professor in the Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory (CSAIL), and recipient of the 2014 ACM A.M. Turing Award. Michael Brodie is a Research Affiliate in CSAIL, while Dong Deng and Raul Castro Fernandez are post-doctoral researchers at CSAIL.