c++ - alignment of automatic variables in DMC
- Laurentiu Pancescu (36/36) Feb 05 2002 Here's a test case:
- Walter (8/44) Feb 05 2002 Thanks for tracking this down! I'll definitely look into a fix. -Walter
- Walter (13/49) Feb 06 2002 Interestingly, this makes 3:1 difference in speed on my machine. The
- Laurentiu Pancescu (61/67) Feb 06 2002 The speed increase is about the same factor on my Athlon (exec time 14
- Walter (17/75) Feb 06 2002 The trouble is, if I align ESP, then the function can't access the passe...
- Jan Knepper (2/6) Feb 06 2002 Nevertheless sounds like something you would do anyways...
- Heinz Saathoff (9/13) Feb 07 2002 Maybe it's not necessary to adjust EPB or ESP when you know that at
- Walter (27/95) Feb 07 2002 I did some investigating. GCC does some fiddling so that each function
- Roland (4/8) Feb 07 2002 can i try too ?
- Laurentiu Pancescu (7/8) Feb 08 2002 or -o+speed).
- Walter (5/12) Feb 08 2002 days...
- Laurentiu Pancescu (8/10) Feb 08 2002 like
- Walter (16/26) Feb 08 2002 Here's what I get:
- Laurentiu Pancescu (8/36) Feb 08 2002 Strange... could you please try sending the attachment as MIME? I alway...
- Roland (19/28) Feb 07 2002 Why not ?
Here's a test case: /* test.c */ #include <stdio.h> #include <time.h> int main( int argc, char *argv[] ) { int i; double x, y, z; clock_t now; printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z); now = clock(); z = 0; for( i = 1; i < 200000000; i++ ) { x = i - 1; y = x - 1; y = x * y; z += y; }; printf("%g\n", z ); printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC); return 0; } If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading to a significantly better performance (just try it!). A workaround is to declare the int *after* the doubles, and still compile with -o+all. This trick doesn't work with BCC, because it thinks it knows better, and rearranges the order of variables on the stack, so you can't avoid performance loss for BCC 5.5.1, AFAIK. GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance. If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed. I think it would be very nice if DMC would get smarter about this (I use an AMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it? Laurentiu
Feb 05 2002
Thanks for tracking this down! I'll definitely look into a fix. -Walter "Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a3pc5n$2n0$2 digitaldaemon.com...Here's a test case: /* test.c */ #include <stdio.h> #include <time.h> int main( int argc, char *argv[] ) { int i; double x, y, z; clock_t now; printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z); now = clock(); z = 0; for( i = 1; i < 200000000; i++ ) { x = i - 1; y = x - 1; y = x * y; z += y; }; printf("%g\n", z ); printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC); return 0; } If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading toasignificantly better performance (just try it!). A workaround is todeclarethe int *after* the doubles, and still compile with -o+all. This trick doesn't work with BCC, because it thinks it knows better, and rearrangestheorder of variables on the stack, so you can't avoid performance loss forBCC5.5.1, AFAIK. GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance. If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed. I think it would be very nice if DMC would get smarter about this (I useanAMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it? Laurentiu
Feb 05 2002
Interestingly, this makes 3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the lay of how things wind up on the stack. The calling conventions specify a 4 byte aligned stack. I don't see at the moment how dynamically adjusting it to 8 bytes within a function is going to work. -Walter "Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a3pc5n$2n0$2 digitaldaemon.com...Here's a test case: /* test.c */ #include <stdio.h> #include <time.h> int main( int argc, char *argv[] ) { int i; double x, y, z; clock_t now; printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z); now = clock(); z = 0; for( i = 1; i < 200000000; i++ ) { x = i - 1; y = x - 1; y = x * y; z += y; }; printf("%g\n", z ); printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC); return 0; } If compiled with -o+all, the double variables are aligned at a 4 byte boundary, while -o+space makes them aligned at 8 byte boundary, leading toasignificantly better performance (just try it!). A workaround is todeclarethe int *after* the doubles, and still compile with -o+all. This trick doesn't work with BCC, because it thinks it knows better, and rearrangestheorder of variables on the stack, so you can't avoid performance loss forBCC5.5.1, AFAIK. GCC seems to align almost anything, including char[] vectors, at 8 or 16 byte boundaries, so it always provides best performance. If you have gcc, use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed. I think it would be very nice if DMC would get smarter about this (I useanAMD Athlon - are other x86 processors less sensitive about this?), but that's up to Walter, isn't it? Laurentiu
Feb 06 2002
The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumed it was no coincidence. I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning of your message. GCC doesn't seem to do any special handling inside the stack frame code, so I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so any called function also starts with an aligned stack?). Doing this might break compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC. What I propose is to dynamically adjust the stack in each function, like in the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax): segment test public use32 class=CODE ; int test(int x) ; { ; int t; ; double a, b; ; t = x + x; ; return t; ; } global _test _test: push ebp ; save EBP, since we use it for mov ebp, esp ; accessing local parameters and esp, 0xFFFFFFF8 ; align the stack at 8 byte boundary ; (ESP normally decreases, so this is okay) add esp, -24 ; reserve space for local vars ; (compiler rearranges vars: doubles first, then ; the int, referring to an hypothetical push order): ; - a [ESP + 16] ; - b [ESP + 8] ; - t [ESP + 0] (4 bytes needed, just alignment demo) mov eax, [ebp + 8] ; EAX <- local param 'x' add eax, eax ; calculate value for 'x + x' mov [esp], eax ; 't' <- EAX mov esp, ebp ; restore the value that ESP had, after EBP was ; pushed, but *before* alignment pop ebp ; restore EBP (LEAVE also works, but like this is clearer) retn ; return value is in EAX, as normal I hope that your news client won't ruin my nice NASM code formatting... :) I think this approach is relatively unexpensive, and allows the compiler to do proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!). Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function. What do you think? Laurentiu "Walter" <walter digitalmars.com> wrote in message news:a3qup4$26oj$1 digitaldaemon.com...Interestingly, this makes 3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the layofhow things wind up on the stack. The calling conventions specify a 4 byte aligned stack. I don't see at the moment how dynamically adjusting it to 8 bytes within a function is going to work. -Walter
Feb 06 2002
The trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -Walter "Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a3rnsp$2i3n$1 digitaldaemon.com...The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumeditwas no coincidence. I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning ofyourmessage. GCC doesn't seem to do any special handling inside the stack frame code,soI guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, soanycalled function also starts with an aligned stack?). Doing this mightbreakcompatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC. What I propose is to dynamically adjust the stack in each function, likeinthe following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax): segment test public use32 class=CODE ; int test(int x) ; { ; int t; ; double a, b; ; t = x + x; ; return t; ; } global _test _test: push ebp ; save EBP, since we use it for mov ebp, esp ; accessing local parameters and esp, 0xFFFFFFF8 ; align the stack at 8 byteboundary; (ESP normally decreases, so this is okay) add esp, -24 ; reserve space for local vars ; (compiler rearranges vars:doublesfirst, then ; the int, referring to an hypothetical push order): ; - a [ESP + 16] ; - b [ESP + 8] ; - t [ESP + 0] (4 bytes needed, just alignment demo) mov eax, [ebp + 8] ; EAX <- local param 'x' add eax, eax ; calculate value for 'x + x' mov [esp], eax ; 't' <- EAX mov esp, ebp ; restore the value that ESP had, after EBP was ; pushed, but *before* alignment pop ebp ; restore EBP (LEAVE also works,butlike this is clearer) retn ; return value is in EAX, asnormalI hope that your news client won't ruin my nice NASM code formatting... :) I think this approach is relatively unexpensive, and allows the compilertodo proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!). Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function. What do you think? Laurentiu
Feb 06 2002
The trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -WalterNevertheless sounds like something you would do anyways... Jan
Feb 06 2002
Walter schrieb...he trouble is, if I align ESP, then the function can't access the passed parameters any more with a fixed ESP offset. What you're doing is accessing the parameters with EBP, and the locals with ESP. I'd thought of that, too, but it's a significant recoding of the code generator. -WalterMaybe it's not necessary to adjust EPB or ESP when you know that at startup ESP is aligned to 8. The calling function must pass parameters aligned, call the function (now only 4 byte aligned), create the stack frame by saving pushing old EPB (now stack is aligned to 8 again). Now make sure that every auto-var is aligned to 8. That's it! Or have I missed a point? Regards, Heinz
Feb 07 2002
I did some investigating. GCC does some fiddling so that each function starts out with an aligned stack. This option will be a bit clumsy for DMC, since I don't have control over the function calling conventions. After spending several hours not being able to get it out of my mind <g>, I figured out a way to do it that has almost no impact on generated code. I can hide nearly all the stack adjustments in code that already adds/subtracts from ESP so that once the stack is 8 byte aligned, it stays that way. Unfortunately, this doesn't work for parameters, i.e. if you call with (double x, int y, double z) they're not going to be aligned. It also doesn't work if some foreign code calls you with a misaligned stack. Oh well. I'll email you the fix so you can try it out (it happens with -o or -o+speed). "Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a3rnsp$2i3n$1 digitaldaemon.com...The speed increase is about the same factor on my Athlon (exec time 14 seconds, as opposed to 4), and, since I saw -o+space makes auto variables being aligned at 8 bytes in 2 programs that I used for testing, I assumeditwas no coincidence. I'm not very sure what you mean by "dynamically adjusting the stack to 8 bytes", so I'm sorry if the following don't match the *real* meaning ofyourmessage. GCC doesn't seem to do any special handling inside the stack frame code,soI guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, soanycalled function also starts with an aligned stack?). Doing this mightbreakcompatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC. What I propose is to dynamically adjust the stack in each function, likeinthe following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax): segment test public use32 class=CODE ; int test(int x) ; { ; int t; ; double a, b; ; t = x + x; ; return t; ; } global _test _test: push ebp ; save EBP, since we use it for mov ebp, esp ; accessing local parameters and esp, 0xFFFFFFF8 ; align the stack at 8 byteboundary; (ESP normally decreases, so this is okay) add esp, -24 ; reserve space for local vars ; (compiler rearranges vars:doublesfirst, then ; the int, referring to an hypothetical push order): ; - a [ESP + 16] ; - b [ESP + 8] ; - t [ESP + 0] (4 bytes needed, just alignment demo) mov eax, [ebp + 8] ; EAX <- local param 'x' add eax, eax ; calculate value for 'x + x' mov [esp], eax ; 't' <- EAX mov esp, ebp ; restore the value that ESP had, after EBP was ; pushed, but *before* alignment pop ebp ; restore EBP (LEAVE also works,butlike this is clearer) retn ; return value is in EAX, asnormalI hope that your news client won't ruin my nice NASM code formatting... :) I think this approach is relatively unexpensive, and allows the compilertodo proper alignment for local variables, since it knows it always starts with an 8-byte aligned stack (not true for local parameters, if you're called some a non-DMC code, but oh well!). Even more, DMC could do normal stack frame for static functions, since they can only be called from the same module, and all functions ensure that the stack is 8 byte aligned before they call any other function. What do you think? Laurentiu "Walter" <walter digitalmars.com> wrote in message news:a3qup4$26oj$1 digitaldaemon.com...byteInterestingly, this makes 3:1 difference in speed on my machine. The problem, however, is it's not related to optimization. It's just the layofhow things wind up on the stack. The calling conventions specify a 48aligned stack. I don't see at the moment how dynamically adjusting it tobytes within a function is going to work. -Walter
Feb 07 2002
Walter a écrit :Unfortunately, this doesn't work for parameters, i.e. if you call with (double x, int y, double z) they're not going to be aligned. It also doesn't work if some foreign code calls you with a misaligned stack. Oh well. I'll email you the fix so you can try it out (it happens with -o or -o+speed).can i try too ? (complicate ? i use idde !) Roland
Feb 07 2002
"Walter" <walter digitalmars.com> wrote in message news:a3tdb7$i5i$1 digitaldaemon.com...I'll email you the fix so you can try it out (it happens with -oor -o+speed). I must confess that I checked my email every 5 minutes in the last 2 days... :) Will this fix be available in the 8.27 release? I can hardly wait to look at the COD file that the new compiler will generate! Laurentiu
Feb 08 2002
"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a412rr$oj7$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a3tdb7$i5i$1 digitaldaemon.com...days...I'll email you the fix so you can try it out (it happens with -oor -o+speed). I must confess that I checked my email every 5 minutes in the last 2:) Will this fix be available in the 8.27 release? I can hardly wait to look at the COD file that the new compiler will generate!I emailed it to you, but your email server bounced it saying it didn't like attachments. Got an email address that can handle large attachments?
Feb 08 2002
"Walter" <walter digitalmars.com> wrote in message news:a419th$2knh$1 digitaldaemon.com...I emailed it to you, but your email server bounced it saying it didn'tlikeattachments. Got an email address that can handle large attachments?Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu
Feb 08 2002
"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a41du4$2n6m$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a419th$2knh$1 digitaldaemon.com...Here's what I get: ---------------------------------------------------------------------------- ---- This is the Postfix program at host fastmail.fm. I'm sorry to have to inform you that the message returned below could not be delivered to one or more destinations. For further assistance, please send mail to <postmaster> If you do so, please include this problem report. You can delete your own text from the message returned below. The Postfix program <lpancescu fastmail.fm>: host localhost[127.0.0.1] said: 552 Uuencoded attachments not accepted --------------------------------------------------------------------------- -----I emailed it to you, but your email server bounced it saying it didn'tlikeattachments. Got an email address that can handle large attachments?Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu
Feb 08 2002
Strange... could you please try sending the attachment as MIME? I always used MIME when sending something from work to my fastmail account, and it worked (I think the largest attachment was about 500k). Laurentiu "Walter" <walter digitalmars.com> wrote in message news:a41imf$2pnk$6 digitaldaemon.com..."Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message news:a41du4$2n6m$1 digitaldaemon.com...--"Walter" <walter digitalmars.com> wrote in message news:a419th$2knh$1 digitaldaemon.com...Here's what I get: --------------------------------------------------------------------------I emailed it to you, but your email server bounced it saying it didn'tlikeattachments. Got an email address that can handle large attachments?Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu---- This is the Postfix program at host fastmail.fm. I'm sorry to have to inform you that the message returned below could not be delivered to one or more destinations. For further assistance, please send mail to <postmaster> If you do so, please include this problem report. You can delete your own text from the message returned below. The Postfix program <lpancescu fastmail.fm>: host localhost[127.0.0.1] said: 552 Uuencoded attachments not accepted --------------------------------------------------------------------------------
Feb 08 2002
Laurentiu Pancescu a écrit :GCC doesn't seem to do any special handling inside the stack frame code, so I guess it knows it starts with an aligned stack, and manages to keep that alignment somehow (maybe it adds unused bytes in every function call, so any called function also starts with an aligned stack?). Doing this might break compatibility with other people's ABI... I don't know exactly, but it doesn't sound like a good solution for DMC.Why not ? If stack starts aligned, just manage yourself it stays so. Compiler can help: - for parameters: if totals size of parameter is not multiple of 4 (or 8), it can pushs some dummy byte so that stack stays aligned. Unaligned parameters can be slow to acces but at least, stack is aligned at function entry. For Pascal call convention, compiler still have to remove the dummy bytes with add esp - for local data, it is the same. We can imagine all parameters are aligned (push 7 dummy byte and a significat byte for a char parameter) The problem is for compatibility with other modules linked with DMC. Optimizer can do so only for function in the same module as the one currently compiled.What I propose is to dynamically adjust the stack in each function, like in the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM syntax):seems to me some "plaster in a wood leg" Roland
Feb 07 2002
"Roland" <rv ronetech.com> wrote in message news:3C624108.E6F7AE4B ronetech.com...Why not ? If stack starts aligned, just manage yourself it stays so. Compiler can help: - for parameters: if totals size of parameter is not multiple of 4 (or 8),itcan pushs some dummy byte so that stack stays aligned. Unaligned parameters can be slow to acces but at least, stack is alignedatfunction entry. For Pascal call convention, compiler still have to remove the dummy byteswithadd esp - for local data, it is the same. We can imagine all parameters are aligned (push 7 dummy byte and asignificatbyte for a char parameter) The problem is for compatibility with other modules linked with DMC. Optimizer can do so only for function in the same module as the onecurrentlycompiled.What you suggest appears to be what GCC does. What I do is slightly different. The called function, not the caller, figures out how many parameter bytes were pushed. Then, the size of the frame for the automatics is adjusted so the grand total works out to be a multiple of 8. The beauty of this is that most of the time no extra code is generated. There are several special cases and complications with this, but I think I took care of all of them but the case of a varargs function. I've deferred fixing that for the moment. -Walter
Feb 07 2002
Walter a écrit :What I do is slightly different. The called function, not the caller, figures out how many parameter bytes were pushed. Then, the size of the frame for the automatics is adjusted so the grand total works out to be a multiple of 8. The beauty of this is that most of the time no extra code is generated.nice !There are several special cases and complications with this, but I think I took care of all of them but the case of a varargs function. I've deferred fixing that for the moment.for varargs, just a warning in the manual is enough as far as i'm concerned Roland
Feb 07 2002
"Roland" <rv ronetech.com> wrote in message news:3C627B5E.1E0D5EFF ronetech.com...Walter a écrit :aWhat I do is slightly different. The called function, not the caller, figures out how many parameter bytes were pushed. Then, the size of the frame for the automatics is adjusted so the grand total works out to beismultiple of 8. The beauty of this is that most of the time no extra codeIgenerated.nice !There are several special cases and complications with this, but I thinkdeferredtook care of all of them but the case of a varargs function. I'veconcerned The only varargs functions that mean anything anyway are printf and scanf, and they don't do heavy loops with doubles. So, while for completeness it should be fixed, as a practical matter it is irrelevant.fixing that for the moment.for varargs, just a warning in the manual is enough as far as i'm
Feb 07 2002