Measuring performance of Lisp functions

Consider the following exercise (Ex. 1.5 in “Paradigms of Artifical Intelligence Programming” by Peter Norvig):

Write a function to compute the dot product of two sequences of numbers, represented as lists. The dot product is computed by multiplying corresponding elements and then adding up the resulting products. Example:
(dot-product '(10 20) '(3 4)) => 110

Here are four implementations (the first one is mine, the other three are the solutions to the exercise in the book)

1. My solution: Recursion with accumulator — tail call optimized (TCO)

(defun dot-product (lst1 lst2 &optional (acc 0))
  "Computes the dot product of two sequences, represented as lists."
  (if (not lst1)
      acc
      (let ((x (first lst1))
	    (y (first lst2))
	    (lst11 (rest lst1))
	    (lst22 (rest lst2)))
	(dot-product lst11 lst22 (+ acc (* x y))))))

2. Solution 1 in PAIP: apply and mapcar

(defun dot-product1 (lst1 lst2)
  (apply #'+ (mapcar #'* lst1 lst2)))

3. Solution 2 in PAIP: recursive without TCO

(defun dot-product2 (lst1 lst2)
  (if (or (null lst1) (null lst2))
      0
      (+ (* (first lst1) (first lst2))
	 (dot-product2 (rest lst1) (rest lst2)))))

4. Solution 3 in PAIP: iteration and indexing

(defun dot-product3 (lst1 lst2)
  (let ((sum 0))
    (dotimes (i (length lst1))
      (incf sum (* (elt lst1 i) (elt lst2 i))))
    sum))

Performance

We test the solutions like this:

Set up some test data:

CL-USER> (defparameter *a* (make-list 100000 :initial-element 1))

dot-product: Use the time macro to measure the performance of the function dot-product:

CL-USER> (time (dot-product *a* *a*))
Evaluation took:
  0.002 seconds of real time
  0.001302 seconds of total run time (0.001301 user, 0.000001 system)
  50.00% CPU
  3,450,092 processor cycles
  0 bytes consed

100000

The function does not allocate any new memory, uses 50% CPU on average and finishes in 0.001302 seconds for a 10,000 element list.

Similarly, we test the other three functions:

dot-product1:

CL-USER> (time (dot-product1 *a* *a*))
Evaluation took:
  0.003 seconds of real time
  0.002625 seconds of total run time (0.002611 user, 0.000014 system)
  100.00% CPU
  6,967,544 processor cycles
  3,205,632 bytes consed

100000

dot-product2:

CL-USER> (time (dot-product2 *a* *a*))
Control stack guard page temporarily disabled: proceed with caution
...

This one aborted because it ran out of stack space before it could complete.

dot-product3:

CL-USER> (time (dot-product3 *a* *a*))
Evaluation took:
  58.350 seconds of real time
  58.347480 seconds of total run time (58.330376 user, 0.017104 system)
  99.99% CPU
  155,215,653,812 processor cycles
  0 bytes consed

100000

So it seems my solution is twice as fast as the fastest one in PAIP and has the advantage of not using any extra memory, whereas the PAIP solution takes up nearly 3MB (32 bytes per element). My solution is 44,813 times faster than the PAIP solution that does not allocate extra memory. Admittedly, the PAIP solutions are not about getting into performance but more about showing how to write code in Lisp; and neither has the author delved into Tail-Call-Optimization yet; so I am not criticizing PAIP at all.

Well, dot-product1 looks very concise and elegant, doesn’t it? What if we try to run it with a larger list?

CL-USER> (defparameter *a* (make-list 1000000 :initial-element 1))
*A*

And now, alas, it also exhausts the stack space:

CL-USER> (time (dot-product1 *a* *a*))
Control stack guard page temporarily disabled: proceed with caution
; Evaluation aborted on #<SB-KERNEL::CONTROL-STACK-EXHAUSTED {100341FE93}>.

Presumably this is because we are trying to pass in 1,000,000 arguments to apply and the arguments have to be pushed on to the stack. Perhaps we can tweak this a bit so that the arguments are not pushed on to the stack:

(defun dot-product4 (lst1 lst2)
  (reduce #'+ (mapcar #'* lst1 lst2) :initial-value 0))

And now this function does run:

CL-USER> (time (dot-product4 *a* *a*))
Evaluation took:
  0.212 seconds of real time
  0.210735 seconds of total run time (0.182024 user, 0.028711 system)
  [ Run times consist of 0.143 seconds GC time, and 0.068 seconds non-GC time. ]
  99.53% CPU
  563,737,032 processor cycles
  47,974,560 bytes consed
  
1000000

Hmm.. it seems to be allocating even more bytes per element than before (47 bytes per element) and there is now some GC overhead. What about my function?

CL-USER> (time (dot-product *a* *a*)) ; *a* is a list with a million elements
Evaluation took:
  0.014 seconds of real time
  0.013474 seconds of total run time (0.013462 user, 0.000012 system)
  92.86% CPU
  36,021,059 processor cycles
  0 bytes consed
  
1000000

Still jolly good!

Update

Svente posted another (more efficient, very readable) solution in the comments below:

(defun dot-product5 (lst1 lst2)
  (loop for element1 in lst1
       for element2 in lst2
       sum (* element1 element2)))

This code uses two nested for loops and a (invisible, hidden) local variable to hold the sum. How does it perform?

CL-USER> (time (dot-product5 *a* *a*))
Evaluation took:
  0.007 seconds of real time
  0.007470 seconds of total run time (0.007470 user, 0.000000 system)
  100.00% CPU
  19,859,546 processor cycles
  0 bytes consed
  
1000000

It performs very well indeed; better than my recursive solution.

Summary
CPU, time, memory taken with a list of 10,000 elements:

Soluton CPU % Time (s) Processor cycles/element Bytes consed
dot-product 50.0% 0.001302 34.5 0
dot-product1 100.0% 0.002625 69.7 3,205,632
dot-product2 does not even run to completion
dot-product3 99.99% 58.35 1552 0

CPU, time, memory taken with a list of 1,000,000 elements:

Soluton CPU % Time (s) Processor cycles/element Bytes consed
dot-product 100.0% 0.010221 27.2 0
dot-product4 99.21% 0.1259 336 48M
dot-product5 100.0% 0.0079 21.2 0
Advertisements
Post a comment or leave a trackback: Trackback URL.

Comments

  • Bogus guy  On May 16, 2012 at 4:29 am

    The main trouble is that, unlike Scheme, the Common Lisp standard doesn’t
    guarantee proper tail calls, so your solution is implementation specific.

    • parijatmishra  On May 16, 2012 at 11:01 am

      True. These tests were done on SBCL on a Mac. Perhaps other implementations would not do TCO. I remember vaguely that CMUCL did not.

  • Svante  On May 16, 2012 at 5:09 am

    First off, I think that the following is pretty idiomatic:

    (defun dot-product-4 (list-1 list-2)
    (loop :for element-1 :in list-1
    :for element-2 :in list-2
    :sum (* element-1 element-2)))

    It also runs about 10% faster than yours.

    Note that tail call optimization is not a guaranteed feature of Common
    Lisp implementations.

    Solution 1 of PAIP is OK. It is not the most efficient one, but it will scale linearly. Solution 2 and 3 have obvious problems. Solution 2 blows the stack, and solution 3 is a Shlemiel the painter (using elt in a loop over a list).

    I have also made the experience that the most flexible utility
    functions are often not compiled to the most efficient code. It seems
    that current compilers are reluctant to apply too aggressive inlining
    and other optimizations. This might get better with higher speed
    optimization declarations, but the flexibility of these functions
    often precludes too much loss of generality.

    Another example is the concatenation of strings. For the occasional
    concatenation, use (concatenate ‘string …), but in tight inner
    loops, it is better to use with-output-to-string or even preallocate
    the target string and loop.

    • parijatmishra  On May 16, 2012 at 11:23 am

      Svente: thank you for your comments. Yes, your solution is both very readable and very efficient. I’ve updated my post to include your solution.

      As for your comment about “Solution 1”:
      | Solution 1 of PAIP is OK. It is not the most efficient one, but it will scale linearly.
      Umm.. it too blows up at a million elements long list. Presumably because apply requires the arguments to be pushed on to the stack.

  • maf  On May 16, 2012 at 1:16 pm

    Simply write (reduce #’+ …) instead of (apply #’+ …). It’s a classic mistake to hit lambda-parameters-limit by using apply too broadly.

    It is true that tail call optimization is not a guaranteed feature of Common
    Lisp implementations, but it is a feature of all CL implementations that are in general use: CMUCL SBCL CCL Allegro LispWorks CLisp ECL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: