To Optimise Or Not To Optimise …

Computers today are faster than at any time in history. For instance,  the PC on which this blog is being typed is running a multitude of applications, and yet, this concurrency barely registers as the characters peel their ways across the screen. Developers bask in the gigahertz of sheer power, rarely stopping to question whether the engine that sits purring beneath their glistening code will ever misfire. Students of Computer Science, however, will probably have come across Big-O as they tussled with the performance of a bubble sort versus a quick sort or such like.

As way of a simplified reminder, Big-O seeks to find the standard time of operation as a function of the size of the input. Values such O(n^2) and O(n log(n) [bubble sort and Quicksort respectively] provide an indication that, as the number of elements increases, the number of operations increase by a power of 2 (for Bubble sort). Of course astute developers will quickly select the algorithm closest to O(1), or realistically, some linear function at best. But for the rest, with their rippling muscular arrays of processors, never take, nor indeed need to care for, the time performance of their code.

The implementation of such Big-O driven algorithms, for most, is typically an academic exercise. However, the computing power that we’ve come to rely on once never existed beyond an 8 bit trickle of flip flops. Clearly, the art and science of computing drove such wild improvements. But there is still a place in today’s world where such Tera flops of computing power are simply unavailable – the embedded world. For whatever reason, Safety, Environmental or contractual obsolescence have demanded the use of smaller, less powerful CPU’s in the embedded world.

Enter the process of optimisation – the skinny sibling of the Big-O family and the big fat rain cloud of the embedded developer.

Those same computer scientists will remember that when an algorithm requires many thousands of repeat runs, the time taken to perform the run once becomes vitally important. However, on an embedded processor (running at just a few megahertz), performing an operation even a few hundred times means the single run time must be optimal. But in the embedded world there is a much closer harmony between High level code and CPU instructions. Most engineers will be familiar with how their respective compiler churns language such as ‘C’ into the myriad of instruction sequences. What can seem like a simple ‘for loop and if test’ in ‘C’ can generate many times the instructions on one compiler/processor when compared to another.

Let’s take a simple contrived example:

U8 sendDataToSPIBus( U8 data_byte[], U16 length,  U8 channel )
{
    U16 current_time;
    BOOLEAN success = TRUE;
    U16 i = 0;

    .... Defensive programming checks removed for conciseness....

    while ( ( length > 0 ) && ( success == TRUE ))
    {
        length--;
        ptr_hw[ channel ].TransmitBuffer = ( U32 ) data_byte[ i ];
        i++;

        /* wait for the Transmit Complete flag to be set */
        current_time = getHardwareTimerValue();

        while ( ptr_hw[ channel ].TransmitComplete == 0ul )
        {
            if ( (getHardwareTimerValue() - current_time) > TIMEOUT_VALUE )
            {
                success = FALSE;
                break;
            }
        }
    }

    /* clear the Transmit complete flag */

    ptr_hw[ channel ].TransmitCompleteClear == 0xffff;
    return success;
}

Actually, this fragment of code demonstrates an issue recently experienced by a client. The code essentially wrote data to an SD card over an SPI bus. However, the volume of data to be written to the card quickly showed an unacceptable amount of time taken. The answer seemed obvious: Optimisation! An engineer quickly did an assessment and found that this function could be improved to take less time. The engineer proceeded to amend the code and demonstrated the function had a time improvement.

The engineer in question decided to take the ‘channel’ aspect and place that in an ‘case’ statement (not the wisest of choices) and then accessed the hardware registers via a hardcoded address value. This removed the array access instructions. His last act of optimisation was to obtain the current time value directly from a hardware register rather than via the function getHardwareTimeValue() and to amend the use of the array index ‘i’. This yielded the following code:

case CHANNEL_0:

{
    while ( ( i < length ) && ( success == TRUE) )
    {
        *(( volatile U32 * const ) 0xF0001000ul) = data_byte[ i ];
        i++;
        current_time = (U16)*(( volatile U32 * const ) 0xf0003000ul);
        while ( *(( volatile U32 * const ) 0xF03100F8ul ) == 0ul )
        {
            if (  ((U16)*(( volatile U32 * const ) 0xf0003000ul) ) -
            current_time ) > TIMEOUT_VALUE )
            {
                success = FALSE;
                break;
            }
        }
    }
}

Job done! or was it?

In fact, this micro optimisation is not uncommon and just like in the field of economics (arguably the study of the effects of optimisation) often detrimental. In this particular case, the effect made very little difference  to the problem of storing a large volume of data to the SD card. The reason was quite simple, the bigger effect occurred at the system level where the Scheduler in use on this project pre-empted the task calling the function. The task was of the lowest priority and spent much of it’s time waiting for high priority tasks to complete. Another problem occurred with the optimisation whereby one of the changes broke another callee of the SPI function ( the case statement for the channel) and not least the lack of portability (amongst other things).

Therefore the message is clear, Optimisation is important and as developers we must always look to write simple but efficient code. However, optimisation needs to be done in the context of the whole system so that it makes the difference expected without other complications.