Abstraction vs. Performance

With most applications these days, there’s no need to worry about performance. Modern computers are amazingly fast, fast enough that most programmers never need to worry about optimization beyond making sure they don’t use overly naive algorithms (or in most cases, not even that: just buy another server for your cool web 2.0 site). However, in some cases you do need to spend some time optimizing. Sometimes you don’t have the luxury of controlling what hardware your application runs on.

When you are forced to care about low level performance, one question arises: Isn’t abstraction bad for performance? Conventional wisdom is that in order to squeeze good performance out of processors, you’ll need to write low level code. Conventional wisdom is mostly true too, but if you carefully employ certain abstractions that don’t inhibit compiler optimization you can get very close to peak machine performance with high level syntax.

There is a movement in the C++ community called generic programming which focuses on this sort of careful use of abstractions. A few other languages also support this style of programming, most notably Haskell and D, but C++ is the most used language in this particular area.


I’m going to show a small example :) Image processing is a task that often involves applying a formula to million of pixels. Especially if you want to do real time processing of video streams, it’s not something you can just easily slap together in PHP or Ruby. Just as an example, assume you’d want to calculate an out image based on taking two images, calculating some silly formula like out = x² + c + y + y³. where x,y are floating point variables representing each pixel and c is some constant.

Implemented in SSE2 SIMD intrinsics it might look something like this:

for(int y=0;y<out.height;y++) {
      __m128 *imgP = (__m128*)img[y];            
      __m128 *img2P = (__m128*)img2[y]);
      __m128 *outP = (__m128*)out[y];
      __m128 iterP = _mm_set_ps1((float)constant);  
      for(int x=0;x<out.width;x+=4) {        
        __m128 m1 = _mm_mul_ps(*imgP, *imgP); 
        __m128 m2 = _mm_add_ps(iterP,*img2P); 
        __m128 m3 = _mm_add_ps(m1,m2);         
        __m128 m4 = _mm_mul_ps(*img2P,*img2P);
        __m128 m5 = _mm_mul_ps(m4,*img2P);     
        *outP = _mm_add_ps(m3,m5);             

Now, that was quite unreadable. It matlab or octave it might have been written like

out = x .* x .+ constant .+ y .+ y.*y.*y;

using the elementwise version of the arithmetic operations. Wouldn’t it be nice if we could express it with similar syntax in C++:

Var x(img);
Var y(img2);
out = x*x + constant + y + y*y*y;

And in fact, we can. Using operator overloading and a technique called expression templates, we can get quite good performance out of readable syntax. Here is a comparison of the two implementations, with the horizontal scale as clock cycles per pixel:


If we take a peek at what the compiler did by looking at the generated assembly we see something surprising: The high level, abstracted, generic, portable code and low level code generated almost exactly the same code. Can you guess which one is which?

L42:                                 L71:
  movl -60(%ebp), %eax                 movl -68(%ebp), %eax
  movl -4(%eax,%edi,4), %esi           movl -4(%eax,%edi,4), %esi
  movl -68(%ebp), %edx                 movl -64(%ebp), %ecx
  movl -4(%edx,%edi,4), %ecx           movl -4(%ecx,%edi,4), %edx
  movl -64(%ebp), %eax                 movl -52(%ebp), %eax
  movl -4(%eax,%edi,4), %edx           movl -4(%eax,%edi,4), %ecx
  movl  $4, %eax                       movl  $4, %eax
  .align  4,0x90                       .align  4,0x90
L43:                                 L72:
  movaps -16(%edx,%eax,4), %xmm3       movaps -16(%esi,%eax,4), %xmm2
  movaps %xmm3, %xmm2                  mulps  %xmm2, %xmm2
  mulps	 %xmm3, %xmm2                  movaps  %xmm0, %xmm1
  mulps	 %xmm3, %xmm2                  addps  -16(%edx,%eax,4), %xmm1
  movaps -16(%ecx,%eax,4), %xmm1       addps   %xmm1, %xmm2
  mulps	 %xmm1, %xmm1                  movaps -16(%edx,%eax,4), %xmm1
  addps	 %xmm0, %xmm1                  mulps   %xmm1, %xmm1
  addps	 %xmm3, %xmm1                  mulps  -16(%edx,%eax,4), %xmm1
  addps	 %xmm2, %xmm1                  addps   %xmm1, %xmm2
  movaps %xmm1, -16(%esi,%eax,4)       movaps  %xmm2, -16(%ecx,%eax,4)
  addl	 $4, %eax                      addl    $4, %eax
  cmpl	 $516, %eax                    cmpl    $516, %eax
  jne L43                              jne  L72
  incl   %edi                          incl    %edi
  cmpl	 $513, %edi                    cmpl    $513, %edi
  jne L42                              jne  L71

(gcc-4.2 -S -march=core2 -msse2 -O3)

The one on the left is actually the expression template version, while the right is the one generated from the SSE intrinsics. In this case gcc has actually even done a slightly better job on the high level code, eliminating a few redundant memory accesses present in the intrinsics code.


An accidental side effect of templates was a turing-complete pure functional code generation language in C++[1]. The basic idea behind the technique is to use that as a code generator to build optimal code from nice syntax.

Expression templates

Normally operator overloading in C++ results in many temporaries and poor performance. The basic idea behind expression templates is to encode abstract syntax tree of the expression in the type of the variable during compilation, and defer evaluation until later.

The trick is that the expression is parsed at compile time, and stored as nested template
arguments of an “expression type”. This allows the compiler to do optimization on the whole expression. It is not a new technique, in fact it was invented in 1994 by T. Veldhuizen[4].

So, if we wanted to implement something like the above, how would it be done? Lets give it a try. First we would need to be able to represent a pixel from an image.

template<typename T> struct Var {
  Var(const Image<float> &img): m_img(img) {}
  const T eval() const { return line[x]; }
   void set_y(const int _y) { line = img[y]; }
   void set_x(const int _x) { x = _x; }
  const Image<T> &m_img;
  int x;
  T * line;

We also need to be able to represent a constant value across the image:

template struct Constant {
  Constant(T c): value(c){}
  const T eval() const { return value; }
  void set_y(const int _y) {}
  void set_x(const int _x) {}

Now that we can define variables and constants, it is time to do something with them. We can define a binary operator, taking a left and right expression (ExprT1, ExprT2) and a function to apply (F) returning a value of RetType.

template <typename RetType, 
          class ExprT1, class ExprT2, 
          template<typename T> class F>
struct BinOp {
   ExprT1 m_expr1;
   ExprT2 m_expr2;
   BinOp (ExprT1 expr1, ExprT2 expr2) : 
     m_expr1(expr1), m_expr2(expr2)  {}
   const RetType eval() const 
      return F<RetType>::apply(m_expr1.eval(), 
   void set_y(const int y)
   void set_x(const int x)

Together this forms an AST, evaluated at compile time. Overloaded operators return the right type templates. The only piece missing from this is the actual arithmetic operations to be passed as the fourth template argument (F) of the BinOp struct.

template<typename T> struct Add {
	static T apply(const T &a, const T &b) { 
          return a + b; 
template<typename T> struct Mul {
	static T apply(const T &a, const T &b) { 
          return a * b; 

Types get quite convoluted however: a x b + 5.0f is represented by BinOp<float, BinOp<float, Var<float>, Var<float>, Mul>, Constant<float>, Add>.

The interesting thing is that when the compiler generates code for eval() on the expression type representing the expression above, what will happen? Somewhat simplified, this is what takes place:

eval() {
   return Add::apply(Multiplication.eval(),

which when you expand the evaluation another level becomes

eval() {
   return Add(Mul::apply(var_a.eval(),

which expands to

eval() {
   return Add::apply(Mul::apply(line_a[x],

now we’re starting to get down to the real data. The apply calls will be replaced by arithmetic and we get

eval() {
   return line_a[x] * line_b[x] + 5.0f;

which in turn will be inlined into the caller. In the end, the all the overhead simply vanishes in a puff of function inlining, for which our trusted compiler is relied upon.

For clarity, I have omitted the details of how SIMD is implemented, but if you just imagine that the the type T is a SIMD word, it is fairly close. The main difference is that the evaluation is done in chunks of N words, where N depends on the element type and SIMD architecture.


C++ is an incredibly convoluted language. It is by far the largest and most complex of widely used programming languages. It has an undecidable grammar, unreadable error messages, takes ages to compile and because the language is so large everybody knows and uses a different subset of the language than everyone else. So why use it at all?

Techniques like this are part of the reason: You can write very powerful libraries, combining abstraction and performance. This is something very few languages can offer. As compilers have matured, these types of design techniques have started to appear in many C++ libraries. For example Blitz++[2], MTL[3] use these and other techniques to build high performance C++ libraries that rival handtuned Fortran code. For some more inspiration into these and similar techniques Alexandrescu’s “Modern C++ Design” is a good introduction.[5]

Unfortunately templates weren’t designed to do this, and the syntax shows it. Anyone familiar with lisp macros would just laugh at the hoops C++ forces you to hop though to get it to act as a code generator. But often we live in the real world, where there are many constraints other than code beauty.


[1] – T. Veldhuizen. C++ templates are Turing complete. 2003.
[2] – Blitz++.
[3] – Matrix Template Library.
[4] – T. Veldhuizen, “Expression Templates,” C++ Report, Vol. 7 No. 5 June 1995, pp. 26-31.
[5] – Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley.

Tags: ,

11 Responses to “Abstraction vs. Performance”

  1. Just for grins, did you do a comparison of the compile times? 😉

    (Yes, I know we’re changing runtime into compile time cost. I’d just like to not pay that price all the time…)

  2. henrik says:

    A valid question! The small test case I illustrated above doesn’t take long to compile. But for a much larger example of this prototype – a file of about 2000 lines and more advanced expressions I get the following compile times:

    $ time g++ -O3 -mssse3 -I. expr_tmpl1.cpp

    real 0m4.261s
    user 0m3.911s
    sys 0m0.240s

    Unfortunately there are no completely free lunches.. complexity has to go somewhere. :)

  3. Brian Cain says:

    Speaking of generics and image processing, consider GIL (now part of Boost). — http://www.boost.org/libs/gil/doc/

    Interesting video, considers performance and abstraction tradeoffs: http://stlab.adobe.com/gil/presentation/index.htm

  4. tbp says:

    I understand you’re trying to make a point about (expression) templates, but i’m sorry you get an instant 0 credibility for providing such an horrendous and non-sensical starting example. What is that article about? Performance? Please.

    typedef float v4sf __attribute__((vector_size(16)));
    const v4sf * const __restrict src1 = (const v4sf*) img[0];
    const v4sf * const __restrict src2 = (const v4sf*) img1[0];
    v4sf * const __restrict dst = (v4sf*) out[0];
    const v4sf iter = { constant, constant, constant, constant };
    for (size_t i=0; i<out.width*out.height; ++i) {
    v4sf x = src1[i], y = src2[i];
    dst[i] = x*x + iter + y + y*y*y;
    Nothing to write home about, but at least not a farce.

  5. henrik says:

    @tbp: The reason the starting example looks the way it does is simply because that was the exact code style implemented in an existing codebase.

    I believe motivating factor was that using SSE intrinsics directly had the advantage that you can influence both instruction scheduling and selection: It is much closer to the generated assembly.

    But yes, if you just look at it in isolation it could definitely be improved, like you’ve done. :)

  6. tbp says:

    *grumpy mumbling* Ok, ok.

    intrinsics give you no more codegen control than any other C/C++ statement. And i’ve abused gcc’s vector builtin operators simply because that was shorter to type; but it’s all the same.
    What was getting on my nerves wasn’t intrinsics verbosity (that, i’m used to), but liberal use of aliasing pointers and gratuitous looping. Anything would be faster than such… ok, i’m getting carried away again 😉

  7. henrik says:

    I’m not going to defend the original code too much, as I pretty much agree with your points :) I’ll restrict myself to mentioning that it was targeted for MSVC++ 2003, which, as far as I know, doesn’t have C99 restrict at all (Fortunately MS added it as __restrict in 2005 and later).

  8. tbp says:

    Still, dereferencing pointers willy-nilly is a bad idea, even more so when they’re not restricted; and a load-compute-store sequence goes a long way alleviating most aliasing troubles.

  9. Hi. I like the way you write. Will you post some more articles?

  10. Hamilcar Barca says:

    Do you have some timings comparison in debug ?
    Using a class to abstract SIMD intrinsics add a big overhead in debug, i know some game engine that are no more usable in debug because of that.
    I suspect using expression template will make the situation worse.

  11. henrik says:

    Debug makes it, much, much slower (roughly an order of magnitude last time I measured). So you’re quite correct.

Leave a Reply