Writing Efficient Code

Encourage Efficient Coding Without Affecting Coding Time

By Mike Katz
Senior Systems Engineer

Writing Efficient Code | Best Practices for Writing Efficient Code | Efficient Code Example

Most writers of embedded code think little about optimizing functions for the underlying microprocessor execution. Sure, modern code optimization technology is usually extremely good, but not all compilers are flawless. Sometimes they can produce some unexpected and undesirable results.

A lot of time can be spent by the programmer analyzing the machine language output of the compiler and hand optimizing the code, either by re-writing the ‘C’ code or coding directly in assembler. In rare instances this is necessary to meet the program requirements. Most of the time extreme and time consuming measures like this aren’t necessary.

We can train ourselves to be more aware of what is going on "under the hood" and therefore create more efficient code while only marginally increasing the coding time.

Most microprocessors use internal registers to perform all of data calculations and data manipulations that it is asked to perform. Sometimes the register set is extremely small, such as in the 80XXX architecture. Sometimes the register set is quite large, such as the Power PC architecture.

In most cases the greater the number of registers available (up to a point) the faster a program will run. With a large number of available data and address (or general purpose) registers the compiler can keep more data and pointers internally without having to perform reads and writes to/from memory.

No matter how we code, at the lowest level, the microprocessor has to load data into a register in order perform any kind of operation on it. If we can reduce the number of operations as well as the number of reads/writes to memory that the microprocessor has to perform, we can reduce the amount of time that our code takes to run.

In certain cases we can give the compiler a little help by understanding what we are trying to accomplish, what the microprocessor will actually need to do to accomplish what we have written and the limitations of the target microprocessor.

Naughty Subscripts

In order for the microprocessor to calculate the address of each member of each structure in an array of structures the following calculations need to be performed.

Destination Address = Array Base Address + ( Size of the Structure * Subscript ) + Structure Member Offset

In the following code snippet the microprocessor might perform these three calculations for each of the three assignments in the loop. If this loop were executed 10,000 times or in an interrupt handler this extra time could be critical. Some optimizers will reduce the number of calculations required but just because an optimizer is used does not mean that it will produce efficient (or desired) results.

#define ARRAY_SIZE 10

typedef struct
{
    I32U    i32uDummy1;
    I32U    *pi32uPointer;
    I32U    *pi16uPointer;
    I32U    *pi8uPointer;
    I32U    i32uDummy2;
} ST_MY_STRUCT;

ST_MY_STRUCT	astMyStruct[ ARRAY_SIZE ];

void Foo( void )
{   
    I32U            i32uCounter;
    I16U            i16uCounter;
    I8U             i8uCounter;
    I32U            i32uMyData;
    I32U            i16uMyData;
    I32U            i8uMyData;

    ST_MY_STRUCT    *pastMyStructPointer;
    I32U            *pai32uMyArrayPointer;
    I16U            *pai16uMyArrayPointer;
    I8U             *pai8uMyArrayPointer;

    /*
        Initialize the pointers in the structure
    */
    for ( i8uCounter = 0 ; i8uCounter < ARRAY_SIZE ; i8uCounter++ )
    {
        astMyStruct[ i8uCounter ].pi32uPointer = &i32uMyData;
        astMyStruct[ i8uCounter ].pi16uPointer = &i16uMyData;
        astMyStruct[ i8uCounter ].pi8uPointer =  &i8uMyData;
    }
}

We could recode the function like this to make it a little more efficient by insuring that the address of the structure in the array is calculated only once. The microprocessor would still need to add in the offset of the structure member but on most modern microprocessors it does this via indexed addressing and does not take up any more machine cycles.

void Foo( void )
{   
    I32U            i32uCounter;
    I16U            i16uCounter;
    I8U             i8uCounter;
    I32U            i32uMyData;
    I32U            i16uMyData;
    I32U            i8uMyData;

    ST_MY_STRUCT    *pastMyStructPointer;
    I32U            *pai32uMyArrayPointer;
    I16U            *pai16uMyArrayPointer;
    I8U             *pai8uMyArrayPointer;

    /*
        Initialize the pointers in the structure
    */
    for ( i8uCounter = 0 ; i8uCounter < ARRAY_SIZE ; i8uCounter++ )
    {
	   pastMyStructPointer = &astMyStruct[ i8uCounter ];

        pastMyStruct->pi32uPointer = &i32uMyData;
        pastMyStruct->pi16uPointer = &i16uMyData;
        pastMyStruct->pi8uPointer =  &i8uMyData;
    }
}

Since this function is performing an iterative process on the array we could recode the function to use “pointer arithmetic”. Since subscripts are not used and the pointer is only incremented we only have a single addition per iteration of the loop. This represents a significant reduction in the amount of operations that the microprocessor performs and therefore a reduction in execution time.

void Foo( void )
{   
    I32U            i32uCounter;
    I16U            i16uCounter;
    I8U             i8uCounter;
    I32U            i32uMyData;
    I32U            i16uMyData;
    I32U            i8uMyData;

    ST_MY_STRUCT    *pastMyStructPointer;
    I32U            *pai32uMyArrayPointer;
    I16U            *pai16uMyArrayPointer;
    I8U             *pai8uMyArrayPointer;

    pastMyStructPointer = astMyStruct;

    /*
        Initialize the pointers in the structure
    */
    for ( i8uCounter = 0 ; i8uCounter < ARRAY_SIZE ; i8uCounter++ )
    {
        pastMyStructPointer->pi32uPointer = &i32uMyData;
        pastMyStructPointer->pi16uPointer = &i16uMyData;
        pastMyStructPointer->pi8uPointer =  &i8uMyData;

        pastMyStructPointer++;
    }
}

This is an example of using pointer arithmetic to avoid time consuming address calculations when performing an interative process on an array. Again if this loop were executed 10,000 times or in an interrupt handler this additional time savings could be crucial.

 

Size of Operands

When doing math, using loop counters or even subscripts be aware of the native register size of your target microprocessor. An 8 bit microprocessor will take many more machine language instructions to increment a 32 bit counter than will a 32 bit microprocessor.

The examples below represent the number of machine language instructions required to increment an unsigned 32 bit value.

;
;  8 Bit microprocessor 32 bit increment (i32uValue++)
;  12 Instructions
;
	LDA	A,Value + 3
	ADD  A,#1			;(or INC A)
	STA  A,Value + 3
	LDA	A,Value + 2
	ADC	A,#0
	STA	A,Value + 2
	LDA  A,Value + 1
	ADC  A,#0
 	STA  A,Value + 1
	LDA	A,Value
	ADC	A,#0
	STA	A,Value

;
;	16 Bit microprocessor 32 bit increment
;	6 Instructions
;
	LD	d0,Value + 3
	ADD	d0,1
	LD	Value + 3,d0
	LD	d0,Value
	ADC  d0,0
	LD	Value, d0 
;
;	32 Bit Microprocessor 32 bit increment
;	3 Instructions
;
	LD.32	r0,Value
	INC.32	r0,1
	LD.32	Value,r0

As you can see, the size of the variables as well as the data size of the target microprocessor that you are using can make a significant impact on the amount of code executed.

Note: On some microprocessors doing a full size operation can still take longer to perform than a shorter one. For example, a 16 microprocessor may take longer to perform a 16 bit operation than an 8 bit operation even though the size of the register is 16 bits.

This is even more critical when doing more complex math. Most modern microprocessors have a multiply instruction but not normally at their full register size. A 16 bit microprocessor might be able to do an 8 bit by 8 bit multiply yielding a 16 bit result in one instruction but a 16 bit by 16 bit multiply yielding a 32 bit result may take several instructions or even a library call.

It helps to keep your math to the smallest word size that will meet your needs. This is especially important if the math is being done repeatedly.

 

Don’t do the Same Operation Repeatedly

If a function will be performing the identical operation on identical data every time it is called consider doing the operation once and putting the result in a static (or even a global if used by multiple modules).

For example:

typedef
{
   volatile ST_UART   *pstUart;
   I16U		   	  i16uBaudRate;
   I8U		       i8uWordSize;
   I8U		       i8uParity;
   I8U		       i8uStopBits;
} ST_UART_CONFIG;

extern ST_UART_CONFIG stUartConfig[];
extern I8U            *pBuffer;

void interrupt Uart1InterruptHandler( void )
{
   /*
    Process all characters in the fifo
   */
   While( 1 == pstUartConfig[0].pstUart->StatusRegister.RxDataAvailable )
   {
      *(pBuffer++) = pstUartConfig[0].pstUart->RxData;
   }
}

Every time this interrupt happens all of that time consuming array/structure address calculations will need to be performed. This can be made a little more efficient as follows:

void interrupt Uart1InterruptHandler( void )
{
   ST_UART volatile *pstUart = pstUartConfig[0].pstUart;
   /*
    Process all characters in the fifo
   */
   While( 1 == pstUart->StatusRegister.RxDataAvailable )
   {
      *(pBuffer++) = pstUart->StatusRegister.RxData;
   }
}

This is better but we are still doing the assignment every time the interrupt is called.
This is even better:

static volatile ST_UART_STATUS_REGISTER *pstStatusRegister;
static volatile I8U                     *pi8uDataRegister
/*
	Call this before enabling the UART interrupt
	It need to be run only once.
*/
Void	UartInterruptHandlerInit( void )
{
	pstStatusRegister = &pstUartConfig[0].pstUart->StatusRegister;
	pi8uDataRegister  = &pstUartConfig[0].pstUart->DataRegister
}

void interrupt Uart1InterruptHandler( void )
{
   /*
	Process all characters in the fifo
   */
   While( 1 == pstStatusRegister->RxDataAvailable )
   {
	*(pBuffer++) = *pi8uDataRegister;
   }
}

The UartInterruptHandlerInit() function is called only once during initialization. By doing this we have removed all of the address calculations from the interrupt handler at the expense of two pointers in RAM. Unless RAM is at an extreme premium this is a worth while trade off.

Let the Compiler Do the Work

When doing repeated calculations using constants let the compiler do the math at compile time rather than doing the calculation each time the function is called.

In the following example we are calculating how many seconds are left in the day based on the second count passed to the function.

I32U	SecondsLeftInToday( I32U i32uElapsedTime )
{
	I32U i32uSecondsPerMinute = 60;
	I32U i32uMinutesPerHour   = 60;
	I32U i32uHoursPerDay	= 24;
	I32U i32uSecondLeftInToday;

	i32uSecondLeftInToday = ( i32uSecondsPerMinute * i32uMinutesPerHour *
		i32uHoursPerDay ) - i32uElapsedTime;

	return i32uSecondsLeftInToday;
}

This function does 3 multiplies and one subtraction. The following function uses the compiler to do the multiplication leaving only the subtraction to be done at run time.

#define SECONDS_PER_MINUTE	60
#define MINUTES_PER_HOUR	60
#define HOURS_PER_DAY		24

I32U SecondsLeftInToday( I32U i32uElapsedtime )
{
	I32U i32uValue;

	i32uValue = ( SECONDS_PER_MINUTE * MINUTES_PER_HOUR *
		HOURS_PER_DAY )  i32uElapsedTimer;

Return i32uValue;
}

Since this is extremely simple math we could replace the function entirely with a macro. This would eliminate all of the overhead of a function call completely:

#define SECONDS_PER_MINUTE	60
#define MINUTES_PER_HOUR	60
#define HOURS_PER_DAY		24

#define SECONDS_LEFT_TODAY( i32uElapsedTime )		    \
		( ( SECONDS_PER_MINUTE * MINUTES_PER_HOUR * \
    		    HOURS_PER_DAY )  i32uElapsedTimer )

Writing Efficient Code | Best Practices for Writing Efficient Code | Efficient Code Example