Added
This commit is contained in:
184
doc/sparc/A
Normal file
184
doc/sparc/A
Normal file
@@ -0,0 +1,184 @@
|
||||
.so init
|
||||
.SH
|
||||
A. MEASUREMENTS
|
||||
.SH
|
||||
A.1. \*(OQThe bottom line\*(CQ
|
||||
.PP
|
||||
Although examples often are most illustrative, the cruel world out there is
|
||||
usually more interested in everyday performance figures. To satisfy those
|
||||
people too, we will present a series of measurements on our code expander
|
||||
taken from (close to) real life situations. These include measurements
|
||||
of compile and run times of different programs,
|
||||
compiled with different compilers.
|
||||
.SH
|
||||
A.2. Compile time measurements
|
||||
.PP
|
||||
Figure A.2.1 shows compile-time measurements for typical C code:
|
||||
the dhrystone benchmark\(dg
|
||||
.[ [
|
||||
dhrystone
|
||||
.]].
|
||||
.FS
|
||||
\(dg To be certain that we only tested the compiler and not the quality of
|
||||
the code in the library, we have added our own version of
|
||||
\fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
|
||||
library.
|
||||
.FE
|
||||
The numbers represent the duration of each separate pass of the compiler.
|
||||
The numbers at the end of each bar represent the total duration of the
|
||||
compilation process. As with all measurements in this chapter, the
|
||||
quoted time or duration is the sum of user and system time in seconds.
|
||||
.PS
|
||||
copy "pics/compile_bars"
|
||||
.PE
|
||||
.DS
|
||||
.IP cem: 6
|
||||
C to EM frontend
|
||||
.IP opt:
|
||||
EM peep-hole optimizer
|
||||
.IP be:
|
||||
EM to assembler backend
|
||||
.IP cpp:
|
||||
Sun's C preprocessor
|
||||
.IP ccom:
|
||||
Sun's C compiler
|
||||
.IP iropt:
|
||||
Sun's optimizer
|
||||
.IP cg:
|
||||
Sun's code generator
|
||||
.IP as:
|
||||
Sun's assembler
|
||||
.IP ld:
|
||||
Sun's linker
|
||||
.ce 1
|
||||
\fIFigure A.2.1: compile-time measurements.\fR
|
||||
.DE
|
||||
.sp
|
||||
.PP
|
||||
A close examination of the first two bars in fig A.2.1 shows that the maximum
|
||||
achievable compile-time
|
||||
gain compared to \fIcc\fR is about 50% for medium-sized
|
||||
programs.\(dd
|
||||
.FS
|
||||
\(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
|
||||
.FE
|
||||
For small programs the gain will be less, due to the almost constant
|
||||
start-up time of each pass in the compilation process. Only a
|
||||
built-in assembler may increase this number up to
|
||||
180% in the ideal case that the optimizer, backend and assembler
|
||||
would run in zero time. Speed-ups of 5 to 10 times as mentioned in
|
||||
.[ [
|
||||
fast portable compilers
|
||||
.]]
|
||||
are therefore not possible on the Sun-4 family. This is also due to
|
||||
Sun's implementation of saving and restoring register windows. With
|
||||
the current implementation in which only a single window is saved
|
||||
or restored on a register-window overflow, it is very time consuming
|
||||
when programs have highly dynamic stack use
|
||||
due to procedure calls (as is often the case with compilers).
|
||||
.PP
|
||||
Although we are currently a little slower than \fIcc\fR, it is hard to
|
||||
blame this on our backend. Optimizing the backend so that it would run
|
||||
twice as fast would only reduce the total compilation process by
|
||||
a mere 14%.
|
||||
.PP
|
||||
Finally it is nice to see that our push/pop-optimization,
|
||||
initially designed to generate faster code, has also increased the
|
||||
compilation speed. (see also figures A.4.1 and A.4.2.)
|
||||
.SH
|
||||
A.3. Run time performance
|
||||
.PP
|
||||
Figure A.3.1 shows the run-time performance of different compilers.
|
||||
All results are normalized, where the best available compiler (Sun's
|
||||
compiler with full optimization) is represented by 1.0 on our scale.
|
||||
.PS
|
||||
copy "pics/run-time_bars"
|
||||
.PE
|
||||
.ce 1
|
||||
\fIFigure A.3.1: run time performance.\fR
|
||||
.sp 1
|
||||
.PP
|
||||
The fact that our compiler behaves rather poorly compared to Sun's
|
||||
compiler is due to the fact that the dhrystone benchmark uses
|
||||
relatively many subroutine calls; all of which have to be 'emulated'
|
||||
by our backend.
|
||||
.SH
|
||||
A.4. Overall performance
|
||||
.LP
|
||||
In the next two figures we will show the combined run and compile time
|
||||
performance of 'our' compiler (the ACK C frontend and our backend)
|
||||
compared to Sun's C compiler. Figure A.4.1 shows the results from
|
||||
measurements on the dhrystone benchmark.
|
||||
.G1
|
||||
frame invis left solid bot solid
|
||||
label left "run time" "(in \(*msec/dhrystone)"
|
||||
label bot "compile time (in sec)"
|
||||
coord x 0,21 y 0,610
|
||||
ticks left out from 0 to 600 by 200
|
||||
ticks bot out from 0 to 20 by 5
|
||||
"\(bu" at 3.5, 1000000/1700
|
||||
"ack w/o opt" ljust at 3.5 + 1, 1000000/1700
|
||||
"\(bu" at 2.8, 1000000/8770
|
||||
"ack with opt" below at 2.8 + 0.1, 1000000/8770
|
||||
"\(bu" at 16.0, 1000000/10434
|
||||
"ack -O4" above at 16.0, 1000000/10434
|
||||
"\(bu" at 2.3, 1000000/7270
|
||||
"\fIcc\fR" above at 2.3, 1000000/7270
|
||||
"\(bu" at 9.0, 1000000/12500
|
||||
"\fIcc -O4\fR" above at 9.0, 1000000/12500
|
||||
"\(bu" at 5.9, 1000000/15250
|
||||
"\fIcc -O\fR" below at 5.9, 1000000/15250
|
||||
.G2
|
||||
.ce 1
|
||||
\fIFigure A.4.1: overall performance on dhrystones.
|
||||
.sp 1
|
||||
.LP
|
||||
Fortunately for us, dhrystones are not all there is. The following
|
||||
figure shows the same measurements as the previous one, except
|
||||
this time we took a benchmark that uses no subroutines: an implementation
|
||||
of Eratosthenes' sieve:
|
||||
.G1
|
||||
frame invis left solid bot solid
|
||||
label left "run time" "for one run" "(in sec)" left .6
|
||||
label bot "compile time (in sec)"
|
||||
coord x 0,11 y 0,21
|
||||
ticks bot out from 0 to 10 by 5
|
||||
ticks left out from 0 to 20 by 5
|
||||
"\(bu" at 2.5, 17.28
|
||||
"ack w/o opt" above at 2.5, 17.28
|
||||
"\(bu" at 1.6, 2.93
|
||||
"ack with opt" above at 1.6, 2.93
|
||||
"\(bu" at 9.4, 2.26
|
||||
"ack -O4" above at 9.4, 2.26
|
||||
"\(bu" at 1.5, 7.43
|
||||
"\fIcc\fR" above at 1.5, 7.43
|
||||
"\(bu" at 2.7, 2.02
|
||||
"\fIcc -O4\fR" ljust at 1.9, 1.2
|
||||
"\(bu" at 2.6, 2.10
|
||||
"\fIcc -O\fR" ljust at 3.1,2.5
|
||||
.G2
|
||||
.ce 1
|
||||
\fIFigure A.4.2: overall performance on Eratosthenes' sieve.
|
||||
.sp 1
|
||||
.PP
|
||||
Although the above figures speak for themselves, a small comment
|
||||
may be in place. At first it is clear that our compiler is neither
|
||||
faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
|
||||
also be noted however, that we do produce better code than \fIcc\fR
|
||||
at only a very small additional cost.
|
||||
It is also worth noticing that push-pop optimization
|
||||
increases run-time speed as well as compile speed.
|
||||
The first seems rather obvious,
|
||||
since optimized code is
|
||||
faster code, but the increase in compile speed may come as a surprise.
|
||||
The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
|
||||
amount of generated code, which in general
|
||||
depends on the efficiency of the code.
|
||||
Push-pop optimization removes a lot of useless instructions which
|
||||
would otherwise
|
||||
have found their way through to the assembler and the loader.
|
||||
Useless instructions inserted in an early stage in the compilation
|
||||
process will slow down every following stage, so elimination of useless
|
||||
instructions in an early stage, even when it requires a little computational
|
||||
overhead, can often be beneficial to the overall compilation speed.
|
||||
.bp
|
||||
Reference in New Issue
Block a user