Optimizing Immutable and Purity

December 22, 2008

In earlier installments, I talked about immutable data and pure functions offering opportunities for better compiler optimization. I’ve been working on taking advantage of these in the D programming language compiler, and so can show what these are.

To recap, immutable data is data that, once set, cannot be modified. It cannot be modified by the current thread, any other thread, or any other reference to that same data. As far as the program logic is concerned, it is engraved in stone. Immutability of data is also transitive, meaning that anything reachable through that immutable data by following pointers, arrays, or other references, is also immutable. Imagine the data in a program as a large graph with values at nodes and directed edges drawn by following pointers; then, immutable data forms subgraphs of that graph that have an interesting property. You can start from a mutable data node and get to immutable data; but once you’re in immutable-land, there’s no way to get back to a mutable node.

A pure function is one that has no side effects, and depends only upon its input parameters. That means it cannot read any mutable state that is not local to the pure function, it cannot write to any global state, and it cannot transitively modify any of its parameters (transitive meaning anything reachable through its parameters). A pure function’s only effect is its return value.

For these examples, I’ll use the D programming language compiler that is currently under development. I encourage you to try the equivalent code out in your favorite language and compiler, and compare. (One of the reasons I’m a compiler nerd is I think it’s fun to examine the assembler output of a compiler, like a motorhead likes to tear an engine apart.)

Let’s start with a pretty simple example:

int bar(int);

int foo(int i)
{
    return bar(i) + bar(i);
}

(On the face of it, this looks like simply bad code that could and should be easily changed by hand into the sensible 2 * bar(i). Let’s not forget, however, that such code is often yielded by other optimization passes that progressively inline and fold various other functions and code blocks, exposing redundancy to later steps.)

Compile it and look at the assembler output with:

dmd -c -O test.d
obj2asm test.obj >log

and the assembler output for the function looks like (I editted it a bit for readability, and added comments):

   push    EAX		; save argument i on stack
    call    bar		; call bar(i)
    push    EAX		; save return value of bar(i) on stack
    sub     ESP,4	; not really needed
    mov     EAX,8[ESP]	; reload argument i to pass to bar()
    call    bar		; call bar(i)
    add     ESP,4	; undo previous unneeded op
    mov     ECX,EAX	; put in ECX result of second bar(i)
    pop     EAX		; put in EAX the result of the first bar(i) call
    add     EAX,ECX	; add the two results together
    pop     ECX		; clean off stack
    ret			; return to caller

Why is bar(i) called twice? Because there’s no way for the compiler to know what bar(i) is doing — it may be incrementing a global counter or printing stuff to a log file. So it must assume the worst and call it twice. But what if bar is pure?

pure int bar(int);

int foo(int i)
{
    return bar(i) + bar(i);
}

Now the assembler output is:

   push    EAX		; save argument i on stack
    call    bar		; call bar(i)
    add     EAX,EAX	; double result
    pop     ECX		; clean stack
    ret			; return to caller

bar(i) is called only once, and since it is a pure function, and i didn’t change, it therefore must return the same value should it be called again, and so the second call is unnecessary. Notice that the compiler did not have access to bar’s body. The pure declaration was all the compiler needed to optimize the second call away. At the definition of bar, the compiler will make sure that bar doth chastely fulfill the purity promise.

And yes, if the optimizer was as good as my 1337 hand coding assembler skillz, I’d write it as:

   call    bar		; call bar(i)
    add     EAX,EAX	; double result
    ret			; return to caller

but it’s hard to get every trick I know wired into the compiler. I’ve never yet met a compiler I couldn’t beat with carefully hand-crafted assembler.

Now for immutability:

int foo(int* a, int* b)
{   int i = *a * *b;
    int j = *a * *b;
    return i + j;
}

and:

   push    EAX
    mov     ECX,8[ESP]
    mov     EAX,[ECX]   ; EAX = *a
    mov     EDX,[ESP]
    imul    EAX,[EDX]   ; EAX *= *b
    add     EAX,EAX
    pop     ECX
    ret     4

Wait a minute. Although (*a * *b) appears twice, the compiler only computed it once and double the result. This is called common subexpression elimination. So far, so good. Let’s now rewrite our example as:

int bar();

int foo(int* a, int* b)
{   int i = *a * *b;
    bar();
    int j = *a * *b;
    return i + j;
}

   sub     ESP,0Ch
    mov     ECX,010h[ESP]
    push    EBX
    mov     8[ESP],EAX
    mov     EDX,[ECX]
    imul    EDX,[EAX]
    mov     0Ch[ESP],EDX
    call    bar
    mov     EAX,014h[ESP]
    mov     EAX,[EAX]
    mov     EBX,8[ESP]
    imul    EAX,[EBX]
    add     EAX,0Ch[ESP]
    pop     EBX
    add     ESP,0Ch
    ret     4

That sure threw a monkey wrench into the results. (*a * *b) is no longer a common subexpression, and gets recomputed. This is because the optimizer has no idea what bar() does — it could modify whatever it is that a and b point to, so it must play it safe and recompute them. But if a and b pointed to immutable data, we’d know that bar() could not affect them:

int bar();

int foo(immutable int* a, immutable int* b)
{   int i = *a * *b;
    bar();
    int j = *a * *b;
    return i + j;
}

   sub     ESP,0Ch
    mov     ECX,010h[ESP]
    mov     EDX,[ECX]
    push    EBX
    imul    EDX,[EAX]
    mov     0Ch[ESP],EDX
    mov     8[ESP],EDX
    call    bar
    mov     EAX,8[ESP]
    mov     EBX,0Ch[ESP]
    lea     EAX,[EBX][EAX]
    pop     EBX
    add     ESP,0Ch
    ret     4

and so the second multiply instruction has been elided, because no matter what bar did, it cannot alter immutable data. But wait, what if a and b were not immutable, but bar was pure?

pure int bar();

int foo(int* a, int* b)
{   int i = *a * *b;
    bar();
    int j = *a * *b;
    return i + j;
}

   push    EAX
    mov     ECX,8[ESP]
    mov     EAX,[ECX]
    mov     EDX,[ESP]
    imul    EAX,[EDX]
    add     EAX,EAX
    pop     ECX
    ret     4

In this case the optimizer outdid itself by completely vanishing the call to bar(), and then nicely squished the two multiplications down to one. This behavior is correct because since bar() has no side effects and the result is discarded, the compiler can discard the entire call. By the very definition of purity, you won’t be able to to ever notice the absent call unless you are willing to place a precise thermal probe on the processor and notice a drop in the heat produced.

Now, on to yet another optimization allowed by the immutable/pure duo: code motion across pure calls. Let’s not discard the result of bar():

pure int bar();

int foo(int* a, int* b)
{   int i = *a * *b;
    int k = bar();
    int j = k + *a * *b;
    return i + j;
}

   sub     ESP,0Ch
    mov     ECX,010h[ESP]
    mov     EDX,[ECX]
    push    EBX
    imul    EDX,[EAX]
    mov     0Ch[ESP],EDX
    mov     8[ESP],EDX
    call    bar
    mov     EBX,8[ESP]
    lea     EAX,[EAX][EBX]
    add     EAX,0Ch[ESP]
    pop     EBX
    add     ESP,0Ch
    ret     4

We’re back to only one multiply. This example shows how pure functions affect the optimizations of the code that surrounds calls to them.

Despite all the gains in CPU technology, one rule that is fairly consistent is that fewer instructions execute faster. Purity and immutability have resulted in fewer instructions, so let’s do a little benchmark.

int bar(int) { return 1; }

pure int purebar(int) { return 1; }

int foo1(int i)
{
    return bar(i) + bar(i);
}

int foo2(int i)
{
    return purebar(i) + purebar(i);
}

int foo3(int* a, int* b)
{   int i = *a * *b;
    int j = *a * *b;
    return i + j;
}

int foo4(int* a, int* b)
{   int i = *a * *b;
    bar(1);
    int j = *a * *b;
    return i + j;
}

int foo5(immutable int* a, immutable int* b)
{   int i = *a * *b;
    bar(1);
    int j = *a * *b;
    return i + j;
}

int foo6(int* a, int* b)
{   int i = *a * *b;
    purebar(1);
    int j = *a * *b;
    return i + j;
}

int foo7(int* a, int* b)
{   int i = *a * *b;
    int k = bar(1);
    int j = k + *a * *b;
    return i + j;
}

int foo8(int* a, int* b)
{   int i = *a * *b;
    int k = purebar(1);
    int j = k + *a * *b;
    return i + j;
}

void main()
{   int a,b;
    immutable int c = 7;
    immutable int d = 8;

    foreach(i; 0 .. 500_000_000)
    {
//	foo1(1);
//	foo2(1);
//	foo3(&a, &b);
//	foo4(&a, &b);
//	foo5(&c, &d);
//	foo6(&a, &b);
//	foo7(&a, &b);
//	foo8(&a, &b);
    }
}

We compile with full optimization, except inlining is turned off so the optimizer will not take advantage of the trivial content of bar. Each call to foo in turn is uncommented, and the time measured for that program to run:

foo1 8.39
foo2 5.73
foo3 4.47
foo4 8.35
foo5 9.12 !!
foo6 4.46
foo7 8.33
foo8 7.68

Things turned out as expected, except foo5 is slower than foo4, despite doing one less multiply. Turns out that a multiply is less expensive than caching a value on the stack for the particular computer I used. (This just goes to show what a tricky business optimization can be.) If it was more than just a multiply saved, it should be faster.

On a real program, how much faster is my code going to get, if I maximize use of pure functions and immutable data? I don’t know. I don’t have any experience with it on a large program. I doubt it will prove to be any barn-burner optimization, but compiler optimizers these days have gotten so good that there isn’t much gain left to be gotten, and every bit helps.

One nice thing about automated optimizations vs. hand-written ones is that the compiler, just like the terminator on a mission, will never, ever, overlook applying them. That’s in contrast to you and me, who in first approximation are less pedantic and do have a life. So once you’ve given the compiler the immutable and pure hints, it will always and everywhere try to take advantage of them.

I also wouldn’t say these optimizations are the primary motivation for using immutability and purity, but they are a nice dividend.

Articles

Optimizing Immutable and Purity