📅 Tue, Aug 9, 2022 ⏲️ ~ 14 minutes Advanced

Testing how smart the Arduino compiler is

Today we are going to test the Arduino compiler by running a series of easy tests to see how smart it is when making optimizations in our code.

Often in the chat, I tell you that you should not be obsessed with the speed of the code and always prioritize cleanliness over speed. Well, today we are going to justify this speech a little bit.

Actually, when we talk about the “Arduino compiler”, we are using a generic C++ compiler, usually Gcc. So, how smart is a C++ compiler like GCC?

Well, at this stage (or as it is usually said, at the “state of the art”) very, very, very smart. Much more than you, me, and even you and I together.

It’s not to be depressing. Simply, current compilers are very advanced programs, developed over many years, with the contributions and work of great geniuses. So it is logical that they do an excellent job optimizing code.

But in case we don’t believe it, let’s do a few basic examples. For the tests we are going to use an ESP32, but it would be the same with another type of processor. Also, the specific times we are going to obtain are irrelevant. What we want to see are the trends, and certain optimizations that we are going to see very clearly.

Loop with increment

Let’s start with a base example, executed on an ESP32.

void printResults(unsigned long long counter, unsigned long long rst, unsigned long long microseconds)
{
  Serial.print("benchmark ");
  Serial.print(counter);
  Serial.print(": ");
  Serial.print(rst);
  Serial.print(" ");
  Serial.print(microseconds);
  Serial.println("us");
}

void benchmark()
{
  auto start = micros();

  unsigned long counter = 1000000;
  unsigned long sum = 0;
  for (size_t i = 0; i < counter; i++)
  {
    sum ++;
  }
  
  auto finish = micros();
  auto microseconds = finish - start;
  printResults();
}

void setup()
{
  Serial.begin(115200);
}

void loop()
{
  delay(5000);
  benchmark();
}

Basically, here we have,

A ‘printResults()’ function that shows the data on the screen
Another benchmark function, which is the one that performs the calculations and tests we want to perform
Finally, a very simple setup function that initializes the port, and a very simple loop that launches the benchmark

In this first base case, we are simply entering a loop 1,000,000 (1E6) times and incrementing a variable, to see how quickly it runs. If we run this, we will see something similar to the second result,

benchmark 1000000: 1000000 1us

And we might think, mmmm… how fast our processor performs a loop of a million elements in one microsecond. So let’s vary the number of times we run the loop. The results are as follows,

10^N	Result	us
3	1000	1
4	10000	1
5	100000	1
6	1000000	1
7	10000000	1
8	100000000	1
9	1000000000	1

Well, here we already see that something strange is happening. The test is always running in a time of one microsecond regardless of the number of times we enter the loop.

On the other hand, we are talking about a 240 Mhz processor. Even taking into account that each operation costs more than one clock cycle, it is logical to expect that between 1E8 and 1E9 we are in the range of one second. And we are in the range of microseconds.

What is happening? The compiler is smart enough to know how to solve the operation without actually needing to run the loop. That is, when compiling the code, the generated binary does not have the loop part. The compiler has replaced it with the final result of the operation.

Loop with literals and constants

Let’s see what happens if instead of simply incrementing a variable, we add a literal.

unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
  sum += 3;
}

If we run the test, we get,

benchmark 1000000: 3000000 1us
benchmark 1000000000: 3000000 1us

That is, the compiler is also smart enough to know how to solve the loop without having to run it.

What if we change it for a constant variable?

const int value = 5;
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
  sum += value;
}

It also knows how to solve it without executing

benchmark 1000000: 50000000 1us

What if it weren’t a variable and it wasn’t constant?

int value = 10;
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
  sum += value;
}

Indeed, the compiler is clever enough to know that this variable is not changing in the code, even if we haven’t marked it with ‘const’

benchmark 1000000: 50000000 1us

Wow, the compiler is so smart. What if we return the value from a function?

int givemeNumber()
{
  return 3;
}

void benchmark()
{
  auto start = micros();
  
  unsigned long counter = 1000000;
  unsigned long sum = 0;
  for (size_t i = 0; i < counter; i++)
  {
    sum += givemeNumber();
  }
  
  auto finish = micros();
  auto microseconds = finish - start;
  printResults(counter, 0, microseconds);
}

Indeed, the compiler has not only been able to calculate the correct result without calling the function, but in the binary code, the loop and the function completely disappear!

benchmark 1000000: 3000000 1us

Loop with variable

Well, it is clear that this compiler is very smart. Let’s try to make its life more difficult. Instead of adding an unchanging number, let’s add a number that does change. For example, let’s add the loop’s own ‘i’,

void benchmark()
{
  unsigned long counter = 1000000;
  unsigned long sum = 0;
  for (size_t i = 0; i < counter; i++)
  {    
    sum += i;
  }
}

We run our base case for 1E6 elements and get.

6  1783293664  8393

Okay, it has indeed entered the loop. We can see this easily because it is no longer 1 microsecond, but 8393ms. Again, we will not evaluate the speed of the ESP32, it is not important. The important thing is that this time the compiler has not been able to calculate the variable without entering the loop.

A little disappointing. If the compiler were a little smarter, it would know that the result is N*(N+1)/2. But I suppose that this is too much to expect from a poor little program (spoiler: we’ll see).

Let’s continue. Just like before, let’s change the number of times we enter the loop. The results we get are as follows.

10^N	Result	us
3	499500	1
4	49995000	1
5	704982704	835
6	1783293664	8393
7	-2014260032	83946
8	887459712	839515
9	3051657984	8395178

Here we see that in the case of 1E5, the range is 835us, and it increases almost perfectly linearly with the number of loop iterations. That is, the program is entering the loop in these cases.

But, for the cases 1E3 and 1E4, the time again becomes 1us. That is, the compiler has been able to calculate the result without needing to run the loop. Is it possible that the compiler does know the formula N*(N+1)/2? And in that case, why does it know how to apply it in 1E3 and 1E4 but not in 1E5 onwards?

Well, for that we have to notice that the calculated result has overflowed. That is, it has exceeded the maximum size of the sum variable, which is an unsigned long. In the case of an ESP32, it is 4 bytes.

In the following table, I show the results of the calculation being performed, indicating when it would overflow the different types of variables.

		int / long	uint / ulong	long long	ulong long
10^N	Result	2,15E+09	4,29E+09	2,31E+18	4,61E+18
1	4,50000000000E+01
2	4,95000000000E+03
3	4,99500000000E+05
4	4,99950000000E+07
5	4,99995000000E+09
6	4,99999500000E+11
7	4,99999950000E+13
8	4,99999995000E+15
9	4,99999999500E+17
10	4,99999999950E+19

We see that precisely, for 1E4 the result has not overflowed. But for 1E5 onwards, it overflows. That is, the compiler knows how to calculate the operation without entering the loop as long as the result variable does not overflow.

If the variable overflows, the compiler has no choice but to run the loop to execute the calculation. And rightly so, because this calculation is no longer as trivial as the equation we mentioned. For this reason, we see this jump between 1E4 and 1E5 iterations.

What if we switch to a long long type variable of 8 bytes? Here are the results,

10^N	Result	us
3	499500	1
4	49995000	1
5	4999950000	7554
6	499999500000	75554
7	49999995000000	755657
8	4999999950000000	7565389
9	499999999500000000	76533470

We observe that the times are much higher. Logical, because the 64-bit operations will cost the ESP32 much more effort, which is a 32-bit processor. But we’ve already said that we don’t care much about the specific value.

What is important is that the results are always correct, and in no case does the result variable overflow. And we see that the times increase linearly… starting from 1E5. We still have the same jump at 1E3 and 1E4, where the compiler is able to remove the loop.

Why for < 1E5, if the variable is not overflowing now? Well, I understand that this is because the processor is 32-bit, so even though we are now using 64-bit variables, the necessary calculations prevent the compiler from optimizing because they require a greater number of operations that the compiler does not know how to eliminate.

Loop with random addition

Let’s try to make it harder for the compiler. How can we make the compiler unable to “get rid of” the loop in any case? That is, to enter it always, regardless of whether it is 1, 10 times, or 1,000,000 times that the loop is executed?

Well, for example, we can add a random variable. This guarantees that “by force” the program will have to enter the loop to calculate it, it will not be able to make any previous optimizations.

unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
  sum += random(10);
}

In reality, it doesn’t have to be a random number. Any non-deterministic variable, such as a digital input, or a value received by a communication, would be valid as well.

We run it and get,

10^N	Result	us
1	random	8
2	random	68
3	random	678
4	random	6809
5	random	68143
6	random	681428
7	random	6814308
8	random	68143113

Logically, the times are much higher due to the overhead of generating the random number. But we don’t care, we are interested in the fact that it has not been able to simplify anything. Thus, we see that from 1E1 to 1E8, the time increases linearly, and it is approximately 10 times higher than the previous one at each level.

Perfect! We have just made the life of a compiler difficult! (you will be proud…) And, by the way, we have demonstrated that everything we assumed and explained earlier is indeed true.

Not using the result

Let’s see another example. Let’s go back to the case where we add ‘i’ 1E6 times. But this time, we are not going to pass the result to the ‘debug’ function. That is, in this case we are not using the ‘sum’ variable other than in the loop, but not afterwards.

void benchmark()
{
  auto start = micros();
  
  unsigned long counter = 1000000;
  unsigned long sum = 0;
  for (size_t i = 0; i < counter; i++)
  {
    sum += i;
  }
  
  auto finish = micros();
  auto microseconds = finish - start;
  printResults(counter, 0, microseconds);  // we are not using sum
}

If we run it, we see that the time goes from 8400us to 0

benchmark 1000000: 0 0us

That is, the compiler has been able to detect that you are performing a calculation that you are not using anywhere. Consequently, it completely removes all the associated code. Indeed, it has gotten rid of the loop and all the variables involved in it.

What if we add random() instead of adding ‘i’ again?,

void benchmark()
{
  auto start = micros();
  
  unsigned long counter = 1000000;
  unsigned long sum = 0;
  for (size_t i = 0; i < counter; i++)
  {
    sum += random(10);
  }
  
  auto finish = micros();
  auto microseconds = finish - start;
  printResults(counter, 0, microseconds);
}

Here, it has not been able to remove the loop even though the result is not being used anywhere after the calculation.

benchmark 1000000: 0 681200us

The reason is that the random() function is complex enough for the compiler not to know what it does. We know that it basically returns a number. But the compiler determines what that function does. So, it plays it safe and does not remove it.

Decremental loop

While we’re at it, let’s try another case. It is common among more experienced programmers that decremental loops (i—) are faster than incremental loops (i++)

This is based on the theory that, according to the theory, the comparison != 0 is faster than the comparison < or <=, so we should scrape a few valuable microseconds. Since we are at it, let’s try it.

void benchmark()
{
  auto counter = 1000000;
  auto sum = 0;
  for (size_t i = counter - 1; i != 0; i--)
  {
    sum += i;
  }
}

And, as expected, the result is exactly the same as when executed incrementally.

benchmark 1000000: 1783293664 8393us

Probably, this was the case 50 years ago, with very old processors and compilers. Furthermore, if there were really any performance difference, the compiler is smart enough to substitute this for you.

Moreover, as I often tell you, nowadays it is best to avoid this type of “tricks”. Trying to optimize the code may make it impossible for the compiler to make some of its optimizations, and even worsen performance.