Do It Right First Time? Or, Fight The Fire Later?

When I was a fledgling engineer, the company I worked for hired a new Technical Director.  I remember it vividly because one of his first presentations, to what was a very large engineering team, made the statement “There’s never enough time to do it right, but there’s always enough time to do it over” (Jack Bergman).

At this particular time, the organisation was suffering from a constant fire fighting method of working.  Work was constantly being shipped that was poorly engineered or darn right buggy.  The next project always started with an intent to “do it right” but problems from the delivery of the previous project rapidly eroded the time available and engineers slipped back into the “just ship it” attitude and the vicious cycle began again.

Sadly, the Technical Director was never able to pull the organisation out of that venus fly trap – the way of working had become a “habit” and no time was spent on coaching the team into the right mindset.

Winding forward a couple of decades, I find myself with an opportunity to demonstrate the benefits of “doing things right”.

A client project required a critical module to be fundamentally reworked.  This particular module had actually been ported from a previous system by an offshore company.  Unfortunately, the module involved a sophisticated interaction with hardware which was not well understood by the offshore company.  What they delivered “worked” but lacked any real code quality.  In fact, the module was written like a “prototype” version despite it being a safety function within a SIL2 system. There were fragments of code that clumsily achieved the necessary function but formulated in an obfuscated way.  The client had attempted to fix a bug with this module but had failed due to the difficulty in understanding how it worked.

We set to work to fix the module.  Firstly we locked in the functional behaviour by creating a set of Unit Tests.  Fortunately, we had excellent domain knowledge for this module and were able to formulate a comprehensive suite of tests.  This was done one function and one Use Case at a time in a backward variation of Test Driven Development (TDD).  Our strategy was to separate out the actual hardware access from the processing element of the module (a good architectural principle). We were able to build stubs and therefore “pretend” to be hardware whilst running on a PC development environment.  These tests were applied to the code base and checked that we adequately captured all of the existing behaviour (including the undesirable bugs).  This test harness was to be our stable ground.

Continuing with the Test Driven Approach we began to refactor the module.  Each single change resulted in the suite of tests being run to confirm that behaviour had not changed.  Occasionally, the refactoring required tests to be created or changed.  The module eventually became understandable and many of the bad code smells removed. We were even able to clearly link the safety requirements to specific code functions making the assessment process smoother.

Having reached the point where the Unit Tests passed on our refactored code, we then formulated a set of integration tests. These were specifically to check that the right hardware interactions occurred and that data/calls to and from this module functioned correctly. This was a structured test in as much as we didn’t want to just plug it in and see if it worked, we wanted to ensure that it initialised correctly, setup its data and that processes got called in the right sequence.

I’ve seen many occasion where Integration tests are just simply a case of running the software.  When (if at the time) it fails the debug cycle begins.  However, that’s incredibly inefficient.  Just like with TDD, if you aim to check each step and a failure occurs then generally you know where the fault lies and you don’t need to spend much time debugging.  Also, just because the software ‘runs’ it doesn’t mean that there are no bugs.  Have a structured plan of integration tests provides a greater chance of finding them.

Fortunately, on our first Integration test we found a bug.  This test was checking that the initialisation occurred but one of the modules public functions was being called at the same time.  This was quickly fixed, the Unit test rerun and Integration continued.  These structured tests took less than half a day to complete but at the end of it we were confident that the system operated correctly.

Our last step was to perform a System Test. It’s important at this point to distinguish the different levels of testing. Unit Testing is aiming to remove the bugs associated with logical and functional behaviour.  Integration Testing is aiming to remove the problems with interacting with other bits of software/system such as missing or malformed data being passed. Finally, System testing is checking the performance of the system including whether it meets the timing constraints and whether it delivers the whole system behaviour.

I like to look at these as sieving the software for bugs – you are progressively removing problems. If you fail to undertake one of these levels of testing ( and the classic is to leave it all to a system test) then you can expect to spend a very long time testing and debugging.  This is because debugging on system test is generally very slow and lacks the granularity of control needed to ensure an adequate test.

Following this approach meant the module was refactored, fixed and tested rapidly. More importantly, other than one integration test that failed, the software worked perfectly on the final system, i.e. “First Time” and is still running without problem.

What this shows is that the investment in following best practise in testing really does pay off.  Making this a habit means that projects will get delivered on time and in budget more often.