2

I'm trying to discern how much performance gains I would achieve by converting Lisp (SBCL) code to assembly language. The time function on SBCL keeps giving counterintuitive results even for trivial operations.

On the one hand, running this implementation

(defun test()
 (declaim (optimize (speed 3) (safety 0) (debug 0)))
 (setf my-array (make-array 10))
 (setf (aref my-array 2) 4)
 (time (progn 
 (aref my-array 1) 
 (aref my-array 3)))
)

reports roughly between 2,200 and 3,000 clock cycles, which in and of itself seems excessive for just a pair of straightforward lookups in a small array.

On the other hand, running this variant under similar conditions

(defun test()
 (declaim (optimize (speed 3) (safety 0) (debug 0)))
 (setf my-array (make-array 10))
 (setf (aref my-array 2) 4)
 (time (progn 
 (aref my-array 1) 
 (aref my-array 3)))
 (time (progn 
 (aref my-array 1) 
 (aref my-array 3))) ;just duplicated previous line
)

results in each call to time reporting between 1,200 and 2,000 clock cycles. Hence the performance is twofold odd: still too many cycles for trivial operations, yet a noticeable difference between calling time once versus twice.

In both versions I placed time inside the function body in order to obviate the overhead of loading the function and preparing for execution. Also I'm omitting the outliers because otherwise the matter looks more volatile. Ranges vary between Linux (Ubuntu 22.4) and Windows, but the issue is essentially the same.

Obviously this method doesn't look promising for deciding on code of much greater complexity. Am I missing something? or is the time built-in function just unreliable for performance measurement?

Rainer Joswig
140k11 gold badges230 silver badges357 bronze badges
asked Aug 30 at 0:49
4
  • I'm surprised the compiler isn't optimizing out the aref calls entirely, since the result isn't used and it has no side effects. Commented Aug 31 at 22:52
  • Have you tried making my-array a local variable and declaring the type? General arrays are very flexible, so aref has to call to a function that does the hard work. Commented Aug 31 at 22:55
  • 2
    SBCL compiles to native CPU instructions so what you see is the CPU optimizations due to caching and cache misses etc. Also it's not obvious to me what my-array is as while the convention for special variables is *earmuffs* it is not enforced so my-array may be special and might miss out on optimization possibilities. Commented Sep 1 at 10:24
  • @Barmar I had ruled out declaring the type upon learning that SBCL's default compilation policy is more conservative than other LISP compilers (SBCL user manual, section 4.2.1, first paragraph). I don't recall reading on alternative compilation settings, though. Commented Sep 2 at 13:11

1 Answer 1

7

The first rule of benchmarking is to measure code which is not hopeless. To do this you both need to write code you think is not hopeless, and listen to what the compiler tells you about it to ensure it really is not (this is particularly important for SBCL).

Here is a version of your test function which is not hopeless:

(defun test ()
 (declare (optimize (speed 3) (safety 0) (debug 0)))
 (let ((my-array (make-array 10)))
 (setf (aref my-array 2) 4)
 (values (aref my-array 1)
 (aref my-array 3))))

This is at least legal, and does not ignore the return values. The SBCL compiler also will compile this with no notes.

Note that measuring the performance of this will measure the allocation of the array as well as two, or three lookups. The function will also return completely undefined values because you have not initialised the array.

The second rule of benchmarking is to measure code that is in some way equivalent to the code you're comparing it with. In your case you are thinking about converting it to assembler, and you want to measure array references. OK, so unless you're planning on writing half of Common Lisp in assembler (there's a famous law...) this probably means you want to measure performance for arrays with fixed, known element types and arrays which are, in CL terms, simple. You also want to measure array reference performance, not array creation. At least, you wish to disentangle those things.

OK, so let's do that. I'll assume the array is a one-dimensional simple array of fixnums.

(deftype array-index ()
 `(integer 0 (,array-dimension-limit)))
(defun test-2 (a n m)
 (declare (optimize (speed 3) (safety 0) (debug 0))
 (type (simple-array fixnum (*)) a)
 (type array-index n m))
 (values (aref a n)
 (aref a m)))

(Don't worry about the array-index type definition: it just saves work in the type declarations for the function.)

The zeroth rule of benchmarking is to measure many iterations of something. Here is a little macro which does that:

(defun time-calls (f n report-every)
 (let ((start (get-internal-real-time)))
 (dotimes (i n)
 (when (and report-every (zerop (mod i report-every)))
 (format *debug-io* "~&~10D of ~D~%" (1+ i) n))
 (funcall f))
 (float (/ (- (get-internal-real-time) start)
 internal-time-units-per-second
 n))))
(defmacro timing ((&optional (n 1) report-every) &body forms)
 "Time a number of iterations of some forms.
The value of this form is the time elapsed per iteration, in seconds,
as a single float.
(timing () ...) times one iteration. (timing (n) ...) times n
iterations. (timing (n m) times n iterations printing a progress
report to *DEBUG-IO* every m.
The forms aren't compiled: you need to use this form in compiled code
for that."
 `(time-calls (lambda ()
 ,@forms)
 ,n
 ,report-every))

So now we can write a wrapper function which will time test-2:

(defun time-test-2 (iters)
 (let ((a (make-array 10 :element-type 'fixnum :initial-element 0)))
 (declare (type (simple-array fixnum (*)) a))
 ;; Time per reference
 (/ (timing (iters)
 (test-2 a 1 2))
 2)))

OK, so we can now time this. You want to time a lot of iterations: for code like this on a modern machine a billion iterations is reasonable. You want it to take a few seconds to run.

> (time-test-2 1000000000)
4.318865e-9

This is on a M1 clocked at (I think) 3.2GHz, so this works out to about 14 clock cycles per reference.

Ah, but.

The fourth rule of benchmarking is to be careful that you are measuring what you think you are measuring. This has at least two parts:

  • 4.1 think about what you're measuring
  • 4.2 try to work out whether the compiler is second-quessing you because your code is too simple and it's optimized it away.

Well in this case, no, we're not measuring what we want to measure here: we're measuring at least

  • a function call into test-2,
  • perhaps a bounds check (we might want to measure this),
  • and an array reference.

Let's try and get rid of the first two: the first because it's just not what we want to measure, and the second because we want to compare things to assembler, which likely is not doing bounds checks.

So we can write this:

(defun test-3 (a n m)
 (declare (optimize (speed 3) (safety 0) (debug 0))
 (type (simple-array fixnum (10)) a)
 (type fixnum n m))
 (let ((s 0))
 (declare (type fixnum s))
 ;; time per iteration so divide by iterations
 (values (/ (timing (m)
 (dotimes (i n)
 (incf s (aref a 0))))
 n)
 ;; Return S as well to try and avoid it being optimized away
 s)))

When compiling this you have to check the first rule, because it does generate a compiler note. But it's OK: the note is abut the final division which happens exactly once.

For this:

> (test-3 (make-array 10 :element-type 'fixnum :initial-element 1)
 10000
 1000000)
9.602013e-10
10000000000

OK, this is about three clock cycles per reference.

... Except no, it's not: we're measuring the loop and increment as well. So we can measure just the loop:

(defun test-4 (v n m)
 (declare (optimize (speed 3) (safety 0) (debug 0))
 (type fixnum v n m))
 (let ((s 0))
 (declare (type fixnum s))
 ;; time per iteration so divide by iterations
 (values
 (/ (timing (m)
 (dotimes (i n)
 (incf s v)))
 n)
 s)))

Measuring this was somewhat hard: I got very variable results, depending, I think, on whether a GC happened during the run. After cranking things up a bit I got

> (test-4 0
 100000
 1000000)
3.4967895e-10
0

So this is a little over 1 clock cycle per loop iteration perhaps. That implies an array reference is perhaps about 2 clock cycles in the simple case.

Of course, you can also just check: if the code is simple enough you can just see what the compiler does. The code for the functions above is pretty big, not least because of timing and the implicit function it creates. But if you write a simple function:

(defun test-5 (a n)
 (declare (optimize (speed 3) (safety 0))
 (type (simple-array fixnum (10)) a)
 (type array-index n))
 (aref a n))

Then

> (disassemble 'test-5)
; disassembly for test-5
; Size: 24 bytes. Origin: #x7008571648 ; test-5
; 48: 49090B8B add TMP, R0, R1, LSL #2
; 4C: 2A1140F8 ldr R0, [TMP, #1]
; 50: FB031AAA mov CSP, CFP
; 54: BF0300F1 cmp NULL, #0
; 58: 5A7B40A9 ldp CFP, LR, [CFP]
; 5C: C0035FD6 ret
nil

The add TMP, R0, R1, LSL #2 is obviously computing the index (the array is initially in R0 I think, and n is in R1), and then ldr R0, [TMP, #1] is smashing R0 with the array element. The rest of the function is just housekeeping. You can also see that with (safety 0) there are not, in fact, any bounds checks in this code.


As a note, it's interesting to see how completely general, safe, one-dimensional array access performs. Given this:

(defun test-general (a i1 i2 &key (repeat 1000) (tries 1000))
 (declare (optimize speed))
 (declare (type (array * (*)) a)
 (type array-index i1 i2)
 (type fixnum repeat tries))
 (setf (aref a i1) repeat)
 (values
 (/ (timing (tries)
 (dotimes (i repeat)
 (setf (aref a i2) (aref a i1))))
 repeat)
 (aref a i2)))

Then there are copious compiler notes. Experimentally a read and write seem to take about 15 cycles. Declaring the array (array t (*)) reduces it to 7, and (simple-array t (*)) drops it to about 2. (simple-array * (*)) is just as bad as a completely general vector.

answered Sep 2 at 10:14
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I need to internalize a few details, but the approach developed in this answer is very educative.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.