Optimizing Immutable and Purity
written by Walter Bright
December 22, 2008
In earlier installments, I talked about immutable data and pure functions offering opportunities for better compiler optimization. I’ve been working on taking advantage of these in the D programming language compiler, and so can show what these are.
To recap, immutable data is data that, once set, cannot be modified. It cannot be modified by the current thread, any other thread, or any other reference to that same data. As far as the program logic is concerned, it is engraved in stone. Immutability of data is also transitive, meaning that anything reachable through that immutable data by following pointers, arrays, or other references, is also immutable. Imagine the data in a program as a large graph with values at nodes and directed edges drawn by following pointers; then, immutable data forms subgraphs of that graph that have an interesting property. You can start from a mutable data node and get to immutable data; but once you’re in immutable-land, there’s no way to get back to a mutable node.
A pure function is one that has no side effects, and depends only upon its input parameters. That means it cannot read any mutable state that is not local to the pure function, it cannot write to any global state, and it cannot transitively modify any of its parameters (transitive meaning anything reachable through its parameters). A pure function’s only effect is its return value.
For these examples, I’ll use the D programming language compiler that is currently under development. I encourage you to try the equivalent code out in your favorite language and compiler, and compare. (One of the reasons I’m a compiler nerd is I think it’s fun to examine the assembler output of a compiler, like a motorhead likes to tear an engine apart.)
Let’s start with a pretty simple example:
int bar(int); int foo(int i) { return bar(i) + bar(i); }
(On the face of it, this looks like simply bad code that could and should be easily changed by hand into the sensible 2 * bar(i). Let’s not forget, however, that such code is often yielded by other optimization passes that progressively inline and fold various other functions and code blocks, exposing redundancy to later steps.)
Compile it and look at the assembler output with:
dmd -c -O test.d obj2asm test.obj >log
and the assembler output for the function looks like (I editted it a bit for readability, and added comments):
push EAX ; save argument i on stack call bar ; call bar(i) push EAX ; save return value of bar(i) on stack sub ESP,4 ; not really needed mov EAX,8[ESP] ; reload argument i to pass to bar() call bar ; call bar(i) add ESP,4 ; undo previous unneeded op mov ECX,EAX ; put in ECX result of second bar(i) pop EAX ; put in EAX the result of the first bar(i) call add EAX,ECX ; add the two results together pop ECX ; clean off stack ret ; return to caller
Why is bar(i) called twice? Because there’s no way for the compiler to know what bar(i) is doing — it may be incrementing a global counter or printing stuff to a log file. So it must assume the worst and call it twice. But what if bar is pure?
pure int bar(int); int foo(int i) { return bar(i) + bar(i); }
Now the assembler output is:
push EAX ; save argument i on stack call bar ; call bar(i) add EAX,EAX ; double result pop ECX ; clean stack ret ; return to caller
bar(i) is called only once, and since it is a pure function, and i didn’t change, it therefore must return the same value should it be called again, and so the second call is unnecessary. Notice that the compiler did not have access to bar’s body. The pure declaration was all the compiler needed to optimize the second call away. At the definition of bar, the compiler will make sure that bar doth chastely fulfill the purity promise.
And yes, if the optimizer was as good as my 1337 hand coding assembler skillz, I’d write it as:
call bar ; call bar(i) add EAX,EAX ; double result ret ; return to caller
but it’s hard to get every trick I know wired into the compiler. I’ve never yet met a compiler I couldn’t beat with carefully hand-crafted assembler.
Now for immutability:
int foo(int* a, int* b) { int i = *a * *b; int j = *a * *b; return i + j; }
and:
push EAX mov ECX,8[ESP] mov EAX,[ECX] ; EAX = *a mov EDX,[ESP] imul EAX,[EDX] ; EAX *= *b add EAX,EAX pop ECX ret 4
Wait a minute. Although (*a * *b) appears twice, the compiler only computed it once and double the result. This is called common subexpression elimination. So far, so good. Let’s now rewrite our example as:
int bar(); int foo(int* a, int* b) { int i = *a * *b; bar(); int j = *a * *b; return i + j; }
sub ESP,0Ch mov ECX,010h[ESP] push EBX mov 8[ESP],EAX mov EDX,[ECX] imul EDX,[EAX] mov 0Ch[ESP],EDX call bar mov EAX,014h[ESP] mov EAX,[EAX] mov EBX,8[ESP] imul EAX,[EBX] add EAX,0Ch[ESP] pop EBX add ESP,0Ch ret 4
That sure threw a monkey wrench into the results. (*a * *b) is no longer a common subexpression, and gets recomputed. This is because the optimizer has no idea what bar() does — it could modify whatever it is that a and b point to, so it must play it safe and recompute them. But if a and b pointed to immutable data, we’d know that bar() could not affect them:
int bar(); int foo(immutable int* a, immutable int* b) { int i = *a * *b; bar(); int j = *a * *b; return i + j; }
sub ESP,0Ch mov ECX,010h[ESP] mov EDX,[ECX] push EBX imul EDX,[EAX] mov 0Ch[ESP],EDX mov 8[ESP],EDX call bar mov EAX,8[ESP] mov EBX,0Ch[ESP] lea EAX,[EBX][EAX] pop EBX add ESP,0Ch ret 4
and so the second multiply instruction has been elided, because no matter what bar did, it cannot alter immutable data. But wait, what if a and b were not immutable, but bar was pure?
pure int bar(); int foo(int* a, int* b) { int i = *a * *b; bar(); int j = *a * *b; return i + j; }
push EAX mov ECX,8[ESP] mov EAX,[ECX] mov EDX,[ESP] imul EAX,[EDX] add EAX,EAX pop ECX ret 4
In this case the optimizer outdid itself by completely vanishing the call to bar(), and then nicely squished the two multiplications down to one. This behavior is correct because since bar() has no side effects and the result is discarded, the compiler can discard the entire call. By the very definition of purity, you won’t be able to to ever notice the absent call unless you are willing to place a precise thermal probe on the processor and notice a drop in the heat produced.
Now, on to yet another optimization allowed by the immutable/pure duo: code motion across pure calls. Let’s not discard the result of bar():
pure int bar(); int foo(int* a, int* b) { int i = *a * *b; int k = bar(); int j = k + *a * *b; return i + j; }
sub ESP,0Ch mov ECX,010h[ESP] mov EDX,[ECX] push EBX imul EDX,[EAX] mov 0Ch[ESP],EDX mov 8[ESP],EDX call bar mov EBX,8[ESP] lea EAX,[EAX][EBX] add EAX,0Ch[ESP] pop EBX add ESP,0Ch ret 4
We’re back to only one multiply. This example shows how pure functions affect the optimizations of the code that surrounds calls to them.
Despite all the gains in CPU technology, one rule that is fairly consistent is that fewer instructions execute faster. Purity and immutability have resulted in fewer instructions, so let’s do a little benchmark.
int bar(int) { return 1; } pure int purebar(int) { return 1; } int foo1(int i) { return bar(i) + bar(i); } int foo2(int i) { return purebar(i) + purebar(i); } int foo3(int* a, int* b) { int i = *a * *b; int j = *a * *b; return i + j; } int foo4(int* a, int* b) { int i = *a * *b; bar(1); int j = *a * *b; return i + j; } int foo5(immutable int* a, immutable int* b) { int i = *a * *b; bar(1); int j = *a * *b; return i + j; } int foo6(int* a, int* b) { int i = *a * *b; purebar(1); int j = *a * *b; return i + j; } int foo7(int* a, int* b) { int i = *a * *b; int k = bar(1); int j = k + *a * *b; return i + j; } int foo8(int* a, int* b) { int i = *a * *b; int k = purebar(1); int j = k + *a * *b; return i + j; } void main() { int a,b; immutable int c = 7; immutable int d = 8; foreach(i; 0 .. 500_000_000) { // foo1(1); // foo2(1); // foo3(&a, &b); // foo4(&a, &b); // foo5(&c, &d); // foo6(&a, &b); // foo7(&a, &b); // foo8(&a, &b); } }
We compile with full optimization, except inlining is turned off so the optimizer will not take advantage of the trivial content of bar. Each call to foo in turn is uncommented, and the time measured for that program to run:
foo1 8.39 foo2 5.73 foo3 4.47 foo4 8.35 foo5 9.12 !! foo6 4.46 foo7 8.33 foo8 7.68
Things turned out as expected, except foo5 is slower than foo4, despite doing one less multiply. Turns out that a multiply is less expensive than caching a value on the stack for the particular computer I used. (This just goes to show what a tricky business optimization can be.) If it was more than just a multiply saved, it should be faster.
On a real program, how much faster is my code going to get, if I maximize use of pure functions and immutable data? I don’t know. I don’t have any experience with it on a large program. I doubt it will prove to be any barn-burner optimization, but compiler optimizers these days have gotten so good that there isn’t much gain left to be gotten, and every bit helps.
One nice thing about automated optimizations vs. hand-written ones is that the compiler, just like the terminator on a mission, will never, ever, overlook applying them. That’s in contrast to you and me, who in first approximation are less pedantic and do have a life. So once you’ve given the compiler the immutable and pure hints, it will always and everywhere try to take advantage of them.
I also wouldn’t say these optimizations are the primary motivation for using immutability and purity, but they are a nice dividend.