C Mythbusters: Counting up vs. down

A while ago, I stumbled upon a guide to writing optimized C code. I had read a few of the ideas before and previously discounted them as “something the optimizer should really do for you”. In this and the next few posts I will take individual pieces of advice and fact check whether rewriting your code as suggested actually helps or not.

Let’s start with a simple proposition: Counting down to zero is cheaper than counting up. A similar proposition is sometimes encountered which this post will investigate as well: Comparing != (not equal) is faster than < lower than in a loop.

This should be easy enough to code, benchmark and look at the assembly. Please scroll down to the end of the post for a full disclaimer and test environment discussion.

Counting up:

1
2
3
4
5
6
#include <stdint.h>
int main() {
  for (int64_t i=0; i<100000000ll; ++i) {
    asm("nop");
  }
}

Counting down:

1
2
3
4
5
6
#include <stdint.h>
int main() {
  for (int64_t i=100000000ll; i!=0; --i) {
    asm("nop");
  }
}

As you can see, the code executes the same number of noops but the up implementation is counting up and comparing using lower then while the down implementation counts down and compares not equal with zero.

Let’s run it and benchmark:

$ perf stat -r 100 -d ./count_up
...
0.440219748 seconds time elapsed ( +-  0.73% )

$ perf stat -r 100 -d ./count_down
...
0.440001691 seconds time elapsed ( +-  0.68% )

What do you know! Same exact thing. Does this mean subtraction and lower then versus not equal is equally fast on this machine? Let’s look at the assembly:

Up:

1
2
3
4
5
6
7
8
9
000082e4 <main>:
    82e4:   e5 9f 20 18     ldr r2, [pc, #24]
    82e8:   e3 a0 30        mov r3, #0
    82ec:   e1 a0           nop
    82f0:   e2 83 30 01     add r3, r3, #1
    82f4:   e1 53 00 02     cmp r3, r2
    82f8:   1a ff ff fb     bne 82ec <main+0x8>
    ...
    8304:   05 f5 e1 00     .word   0x05f5e100

Down:

1
2
3
4
5
6
7
8
9
000082e4 <main>:
    82e4:   e5 9f 20 18     ldr r2, [pc, #24]
    82e8:   e3 a0 30        mov r3, #0
    82ec:   e1 a0           nop
    82f0:   e2 83 30 01     add r3, r3, #1
    82f4:   e1 53 00 02     cmp r3, r2
    82f8:   1a ff ff fb     bne 82ec <main+0x8>
    ...
    8304:   05 f5 e1 00     .word   0x05f5e100

Yep, it looks like I objdump -Sed the same file twice, however, I did not. This actually generates the same assembly. Even cooler, it generates completely different assembly with -O0 but optimizes to the exact same assembly. Our benchmarking results make a lot of sense.

Therefore, both

are busted for cases where the compiler is free to make this optimization. One must not discount that compilers have advanced quite a bit in the 10 years from the time the article I referenced earlier was written but this is still great to know. I remember at least one occasion when someone suggested rewriting a somewhat hot loop this way and this basically tells us that – as long as the loop is changed the way it was changed here – it’s probably not gonna lead to anything whatsoever. While the whole idea of rewriting a loop construct like this is only sensible for the absolute hottest of loops, counting down is not a bulletproof solution.

Before we look at the case when the compiler can not optimize the loop, let’s quickly check the results on more common platforms.

Other platforms:

Intel® Xeon® CPU X7560 @ 2.27GHz, gcc 4.9:

0.089669005 seconds time elapsed ( +-  0.01% )
0.089674167 seconds time elapsed ( +-  0.01% )

Interestingly, this machine counts down but also optimizes both programs to the exact same assembly. A guess as to why this is done is because mov can actually take the constant 0x5f5e100 as an immediate, therefore the compiler saves one register.

1
2
3
4
5
6
7
8
0000000000400400 <main>:
  400400:   b8 00 e1 f5 05          mov    $0x5f5e100,%eax
  400405:   0f 1f 00                nopl   (%rax)
  400408:   90                      nop
  400409:   48 83 e8 01             sub    $0x1,%rax
  40040d:   75 f9                   jne    400408 <main+0x8>
  40040f:   31 c0                   xor    %eax,%eax
  400411:   c3                      retq   

Intel® Atom™ CPU N2800 @ 1.86GHz, gcc 4.8.2:

0.113469278 seconds time elapsed ( +-  0.60% )
0.112894989 seconds time elapsed ( +-  0.23% )

Ditto with the subtraction in both cases.

1
2
3
4
5
6
7
8
0000000000400400 <main>:
  400400:   b8 00 e1 f5 05          mov    $0x5f5e100,%eax
  400405:   0f 1f 00                nopl   (%rax)
  400408:   90                      nop
  400409:   48 83 e8 01             sub    $0x1,%rax
  40040d:   75 f9                   jne    400408 <main+0x8>
  40040f:   31 c0                   xor    %eax,%eax
  400411:   c3                      retq   

Loops with a data dependent body – from C to machine code

Wisely, Thomas Neumann suggested going beyond whether writing your C code nop loop in a “count up” or “count down” manner matters. This adds a much more important question that has so far not been neglected in this writeup: What if the compiler simply can not change the way your loop is counting because you are actually using the loop variable inside the loops body in a way that does not give the compiler the freedom to change the direction of the loop.

To generate a sensible benchmark for this deeper investigation, we examine the instructions required to count up versus down and stop gcc from optimizing one into the other. From the previous examples, we assume that there is in fact a difference in terms of execution time of the instructions used to count up versus those used for counting down. If there was none, gcc would likely not go through the hassle of optimizing one into the other (and in different directions as well: ARM gcc converts to loops counting up, the two x86 platforms examined convert to instructions counting down).

In C, we go with Thomas’s suggestion (see code on GitHub, shortened here for readbility) and implement the benchmarks along the lines of this C code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
...
const int64_t l=100000000ll;
void insn_down() {
  for (int64_t i=l; i!=0; --i) asm volatile("nop"::"r"(i));
}
void insn_up() {
  for (int64_t i=0; i!=l; ++i) asm volatile("nop"::"r"(i));
}
void insn_up_lt() {
  for (int64_t i=0; i<l; ++i) asm volatile("nop"::"r"(i));
}
...

Now we validate that gcc indeed generates the code we were looking for:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
00000000004005f0 <insn_up_eq>:
  ...
  400600:   90                      nop
  400601:   48 83 c0 01             add    $0x1,%rax
  400605:   48 39 c7                cmp    %rax,%rdi
  400608:   75 f6                   jne    400600 <insn_up_eq+0x10>
  ...
00000000004005a0 <insn_down>:
  4005b0:   90                      nop
  4005b1:   48 83 ef 01             sub    $0x1,%rdi
  4005b5:   75 f9                   jne    4005b0 <insn_down+0x10>
  ...
00000000004005c0 <insn_up_lt>:
  4005d8:   90                      nop
  4005d9:   48 83 c0 01             add    $0x1,%rax
  4005dd:   48 39 c7                cmp    %rax,%rdi
  4005e0:   7f f6                   jg     4005d8 <insn_up_lt+0x18>

This looks really great, however it is on x86. For ARM, the C code leads to terrible code being generated. For the purposes of this evaluation, we will therefore not look at ARM for now as the arm benchmark would require manually written (inline) assembly.

Measuring this on the same Atom CPU used above, we get the following results:

count up:        0.114563958 seconds time elapsed ( +-  0.92% )
count dn:        0.112721510 seconds time elapsed ( +-  0.24% )
insn  up:        0.166884574 seconds time elapsed ( +-  0.28% )
insn  dn:        0.113022031 seconds time elapsed ( +-  0.51% )
insn  lt:        0.193557023 seconds time elapsed ( +-  0.31% )

Quite obviously, counting down is the better choice on this machine (Intel Atom) when it comes to machine code. This is also visible in the fact that insn_down requires one instruction less as seen above. Also, gcc makes the right decision and optimizes both ways of expressing the loop in C to the faster option

Running the same benchmark on the Intel Xeon mentioned before, we see a completely different picture, first the assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
00000000004005f0 <insn_up_eq>:
  ...
  400600:   90                      nop
  400601:   48 83 c0 01             add    $0x1,%rax
  400605:   48 39 c7                cmp    %rax,%rdi
  400608:   75 f6                   jne    400600 <insn_up_eq+0x10>
00000000004005a0 <insn_down>:
  4005b0:   90                      nop
  4005b1:   48 83 ef 01             sub    $0x1,%rdi
  4005b5:   75 f9                   jne    4005b0 <insn_down+0x10>
00000000004005c0 <insn_up_lt>:
  ...
  4005d8:   90                      nop
  4005d9:   48 83 c0 01             add    $0x1,%rax
  4005dd:   48 39 c7                cmp    %rax,%rdi
  4005e0:   7f f6                   jg     4005d8 <insn_up_lt+0x18>

While the assembly is again one instruction shorter for the countdown loop, the benchmark results paint a different picture:

count up:        0.089699257 seconds time elapsed ( +-  0.01% )
count dn:        0.089653781 seconds time elapsed ( +-  0.01% )
insn  up:        0.089669413 seconds time elapsed ( +-  0.02% )
insn  dn:        0.089669189 seconds time elapsed ( +-  0.02% )
insn  lt:        0.089663413 seconds time elapsed ( +-  0.02% )

All implementations are equally fast. Regardless whether the loop is expressed in C as counting up or down and also regardless whether the machine code used counts up from zero or down to zero, the performance of the loop is exactly the same.

Summary:

In summary, we can say the following: If the compiler is free to choose the direction in which the loop counts, you don’t need to worry – it will do the right thing. If a data dependency stops your loop from being optimizable in this fashion, the question becomes more interesting. On a fairly wimpy core like my Intel Atom, couting down and comparing for (in-)equality was faster. On a premium CPU like the Intel Xeon tested here, it did not matter at all, all implementations of the for loop were equally fast.

As a general take away, one should also remember that the difference (if there is one on your architecture) between the direction of the loop only matters significantly if there’s next to no work being done in the loop’s body. For instance, if you were to access memory in the loop, this will likely completely dominate the cost of executing the loop.

Disclaimer and test environment:

Unless otherwise noted, all measurements quoted here were taken with this setup:

Code on GitHub.

comments powered by Disqus
Blog by .