Making Sense of the Legacy Database, Introduction


For the past 3 years I spent a lot of time digging through relational databases. Most of these databases that I looked at could be considered legacy databases. In one case it was an Oracle database, in another a Sybase database. They are legacy systems in the sense that the technology with which they have been built has been around for quite a while. Heck, relational databases were conceived in the 1970’s (link opens PDF), so from the perspective of computer history the concept of the relational database is pretty old.

The databases I worked with can be considered ‘legacy’ in another way. At least one of them has been in existence, and thus has been under maintenance, for 15 years. So, in terms of a software life cycle this database is relatively old. It is a monolith that has been crafted and honed by generations of software developers, so to speak.

And there is a third way in which these databases could be considered ‘legacy’. It appears that for at least a couple of these systems, someone was able to make the documentation about the inner workings of the database disappear almost entirely. And not only that; over time the people who knew the database intimately moved away from the company. So what remains is a database that is largely undocumented, with very few people who can tell you the intricacies of its existence. Clearly this is an exceptional situation… or maybe not.

Now, both knowledge and joy can be gained from studying the database. It is reasonable to assume that the database is a reflection (a model, if you will) of the world as the company sees it. In it is stored knowledge about elements of the world outside and the relationships between these elements. As a way of gaining deep knowledge about how its builders classify and structure the world, studying the database is fine starting point.

Indeed, one of the goals I have when I study the database, is to get to know how knowledge is classified. I study in order to gain knowledge about the functional aspects and the meaning of the system. Some may regard the database as a technical thing. Those sentiments may be strengthened by the fact that knowing how to write SQL queries is usually considered to be a ‘technical’ skill, one that ‘functional’ testers stay away from. That is a very damaging misconception, withholding a fine research instrument from the hands of the budding tester. SQL is not anything anyone should ever be afraid of.

But gaining knowledge about functional aspects is just one object of the study. Another one is that mastering a database allows you to verify what information is stored and whether it is stored correctly or not. This, clearly, allows you to check (test) the functioning of the system. Mastering the database also allows you to manipulate data for the purpose of testing and to create new data sets. And finally, it allows you conjure up answers to questions such as “Show me the top 100 books that were sold last year to customers from the Netherlands above the age of 45” much more quickly than through any GUI. So it can be used for analytical purposes.

The question I would like to answer is how do we analyze the database. What methods do we employ to get to the core of knowledge that may be spread out over hundreds of tables. There are a few methods that make use of the typical features of a relational database. Other methods are more general in nature. In other blog posts I would like to dig deeper into these methods of investigation. But for now I would like to leave you with a quick overview and a description of the first method.


When you are going to analyze a database, you will be writing SQL queries. Many of them, probably. If you are going to place all of your queries in a single file, that will become chaos rather quickly.  Bad organization will be a really heavy dead weight during the whole of your investigation. I like to organize my queries in folders, in which case these folders represent broad classifications. For examples, I create a folder containing all queries relating to orders, another one relating to customers and another one relating to financial data. Usually, this division really helps you to quickly zoom in on a specific area.

Within a folder there may be many different files containing queries. Again I try to group queries by ‘functional’ area. So I may create a file containing queries relating to ‘cancelled orders’ or ‘orders that are waiting to be processed’, or ‘orders from returning customers’. I first tried to group queries in files based on the table that was queried (such as a single file containing all queries on the ‘order’ table) but for some reason a ‘functional’ classification is easier to comprehend.

One of integrated development tools I worked with – Oracle SQL Developer – has a plugin for Subversion, so it allows you to keep and maintain a repository and a version history of the files containing your queries and to share it easily with other team members. I learned the hard way that keeping your laboriously built up set of queries solely on a single internal hard disk is not such a good idea.

Like in other programming languages, in SQL it is possible to add comments to your queries. I found out that comments are an essential part of your investigation. In SQL queries comments are usually written with two hyphens (–) at the beginning. Queries can quickly become long and complex. If your files that contain the queries contain only SQL statements and nothing else, this will seriously hamper your investigation. You will need to read each query again and again in order to find out what it meant. Reading a SQL query that joins many tables can be really tough, if you want to find exactly what information you are pulling from the database. Also, queries may look very similar but may differ on a slight detail that you will only find when you read and conceptualize the whole thing. And last but not least, since writing queries takes time and mental effort, your do not want to write duplicate queries. The only way to prevent you from doing things many times over is to have an easy way to detect if you already ‘have a query for that’.

So comments are essential as a means of identification or indexing. I like to go over my files with queries from time to time, ‘grooming’ and improving the comments and the queries themselves.

So far for the organization of queries. Below are the other topics that I like to cover as methods for the investigation of the legacy database. I will get back to these in posts to come.

  • Structuring the query
  • Testing your queries
  • What to do with the data model (if it exists)
  • Searching for distinct values
  • Paying attention to numbers
  • Scanning data & data patterns
  • Using dates
  • Complex joins
  • Emptiness (null values)
  • Querying the data model
  • Comparing the query results with the application
  • Looking for names
  • Use of junction tables
  • Use of lookup tables
  • Database tools and integrated development environments