Archive for the ‘programming’ Category

A programmer’s time

Saturday, July 18th, 2009

A programmer’s time in front of a computer is to a large extent spent manipulating text. While mundane compared to the actual knowledge of implementing systems, it’s the basic trade craft of programming. So we care about text editors. A lot. There’s been feuds among programmers which editor is the best since the beginning of time (that is, 1950’s, or so).

The difference, in my mind, between a good programmer’s editor and a great one is thousands of small little tweaks. In a great editor, you can quickly and easily automate repetitive tasks and there are commands for handling hundreds of things that may not be commands that you do every time – but once you learn enough of them you’ll end up saving a lot of time. This is a lot of the reason why editors like emacs and vim which have been around for decades are still used and loved: They’ve been around long enough so that people have written all these clever shortcuts which makes a programmer’s job easier. Replicating these decades of tweaks is not an easy task.

I discovered Qt Creator a few weeks ago. It definitely holds a lot of promise, especially considering its age – it’s only been out and public for a few months; though obviously the Trolltech people have been working on it internally for longer than that. As an IDE, it’s absolutely fantastic. It has some of the feel of the Java or C# IDE’s that everybody else has been getting jealous over for the last decade. As an editor however, there’s work to be done.

One fairly common task in C++ is that you have a header file, like the one below, and you want to rename the class Foo to something else.

#ifndef FOO_H
#define FOO_H
 
class Foo {
    void foo();
}
 
#endif // FOO_H

Say we want to rename everything from foo to bar. Easily done, just replace foo with bar globally. This is what we’ll get in most editors:

#ifndef bar_H
#define bar_H
 
class bar {
    void bar();
}
 
#endif // bar_H

You’ll see that the editor did exactly what we requested. Replaced foo with bar. Now, unfortunately, that means that some poor schmuch (that is, me) will have to go back and fix the capitalization, because all good programmers care about consistency and code style and not to mention that bar and Bar are two different things in almost all programming languages.

If you do the replace command in emacs however, by default it’ll preserve the case of the replaced text:

#ifndef BAR_H
#define BAR_H
 
class Bar {
    void bar();
}
 
#endif // BAR_H

It saves me maybe 30 seconds of typing each time, but over time it adds up. More importantly, I don’t feel like the editor is interrupting me. I can continue with what I wanted to accomplish, without breaking my flow with trivial details such as the case of a few words.

So, what does this have to do with Qt Creator that I mentioned earlier? I took a look at the source code of it, and found it rather pleasant to fix and hack around in.

So I created a patch which implements this case preserving replace in Qt Creator. There are hundreds of other small tweaks that would make the editor in Creator go from a good to a great editor, hopefully I can continue helping by implementing at least a few more.

Abstraction vs. Performance

Friday, January 16th, 2009

With most applications these days, there’s no need to worry about performance. Modern computers are amazingly fast, fast enough that most programmers never need to worry about optimization beyond making sure they don’t use overly naive algorithms (or in most cases, not even that: just buy another server for your cool web 2.0 site). However, in some cases you do need to spend some time optimizing. Sometimes you don’t have the luxury of controlling what hardware your application runs on.

When you are forced to care about low level performance, one question arises: Isn’t abstraction bad for performance? Conventional wisdom is that in order to squeeze good performance out of processors, you’ll need to write low level code. Conventional wisdom is mostly true too, but if you carefully employ certain abstractions that don’t inhibit compiler optimization you can get very close to peak machine performance with high level syntax.

There is a movement in the C++ community called generic programming which focuses on this sort of careful use of abstractions. A few other languages also support this style of programming, most notably Haskell and D, but C++ is the most used language in this particular area.

Example

I’m going to show a small example :) Image processing is a task that often involves applying a formula to million of pixels. Especially if you want to do real time processing of video streams, it’s not something you can just easily slap together in PHP or Ruby. Just as an example, assume you’d want to calculate an out image based on taking two images, calculating some silly formula like out = x² + c + y + y³. where x,y are floating point variables representing each pixel and c is some constant.

Implemented in SSE2 SIMD intrinsics it might look something like this:

for(int y=0;y<out.height;y++) {
      __m128 *imgP = (__m128*)img[y];            
      __m128 *img2P = (__m128*)img2[y]);
      __m128 *outP = (__m128*)out[y];
      __m128 iterP = _mm_set_ps1((float)constant);  
      for(int x=0;x<out.width;x+=4) {        
        __m128 m1 = _mm_mul_ps(*imgP, *imgP); 
        __m128 m2 = _mm_add_ps(iterP,*img2P); 
        __m128 m3 = _mm_add_ps(m1,m2);         
        __m128 m4 = _mm_mul_ps(*img2P,*img2P);
        __m128 m5 = _mm_mul_ps(m4,*img2P);     
        *outP = _mm_add_ps(m3,m5);             
        *outP++;
        *imgP++;
        *img2P++;
    }
}

Now, that was quite unreadable. It matlab or octave it might have been written like

out = x .* x .+ constant .+ y .+ y.*y.*y;

using the elementwise version of the arithmetic operations. Wouldn’t it be nice if we could express it with similar syntax in C++:

Var x(img);
Var y(img2);
 
out = x*x + constant + y + y*y*y;

And in fact, we can. Using operator overloading and a technique called expression templates, we can get quite good performance out of readable syntax. Here is a comparison of the two implementations, with the horizontal scale as clock cycles per pixel:

performance_comparison

If we take a peek at what the compiler did by looking at the generated assembly we see something surprising: The high level, abstracted, generic, portable code and low level code generated almost exactly the same code. Can you guess which one is which?

L42:                                 L71:
  movl -60(%ebp), %eax                 movl -68(%ebp), %eax
  movl -4(%eax,%edi,4), %esi           movl -4(%eax,%edi,4), %esi
  movl -68(%ebp), %edx                 movl -64(%ebp), %ecx
  movl -4(%edx,%edi,4), %ecx           movl -4(%ecx,%edi,4), %edx
  movl -64(%ebp), %eax                 movl -52(%ebp), %eax
  movl -4(%eax,%edi,4), %edx           movl -4(%eax,%edi,4), %ecx
  movl  $4, %eax                       movl  $4, %eax
  .align  4,0x90                       .align  4,0x90
L43:                                 L72:
  movaps -16(%edx,%eax,4), %xmm3       movaps -16(%esi,%eax,4), %xmm2
  movaps %xmm3, %xmm2                  mulps  %xmm2, %xmm2
  mulps	 %xmm3, %xmm2                  movaps  %xmm0, %xmm1
  mulps	 %xmm3, %xmm2                  addps  -16(%edx,%eax,4), %xmm1
  movaps -16(%ecx,%eax,4), %xmm1       addps   %xmm1, %xmm2
  mulps	 %xmm1, %xmm1                  movaps -16(%edx,%eax,4), %xmm1
  addps	 %xmm0, %xmm1                  mulps   %xmm1, %xmm1
  addps	 %xmm3, %xmm1                  mulps  -16(%edx,%eax,4), %xmm1
  addps	 %xmm2, %xmm1                  addps   %xmm1, %xmm2
  movaps %xmm1, -16(%esi,%eax,4)       movaps  %xmm2, -16(%ecx,%eax,4)
  addl	 $4, %eax                      addl    $4, %eax
  cmpl	 $516, %eax                    cmpl    $516, %eax
  jne L43                              jne  L72
  incl   %edi                          incl    %edi
  cmpl	 $513, %edi                    cmpl    $513, %edi
  jne L42                              jne  L71

(gcc-4.2 -S -march=core2 -msse2 -O3)

The one on the left is actually the expression template version, while the right is the one generated from the SSE intrinsics. In this case gcc has actually even done a slightly better job on the high level code, eliminating a few redundant memory accesses present in the intrinsics code.

How?

An accidental side effect of templates was a turing-complete pure functional code generation language in C++[1]. The basic idea behind the technique is to use that as a code generator to build optimal code from nice syntax.

Expression templates

Normally operator overloading in C++ results in many temporaries and poor performance. The basic idea behind expression templates is to encode abstract syntax tree of the expression in the type of the variable during compilation, and defer evaluation until later.

The trick is that the expression is parsed at compile time, and stored as nested template
arguments of an “expression type”. This allows the compiler to do optimization on the whole expression. It is not a new technique, in fact it was invented in 1994 by T. Veldhuizen[4].

So, if we wanted to implement something like the above, how would it be done? Lets give it a try. First we would need to be able to represent a pixel from an image.

template<typename T> struct Var {
  Var(const Image<float> &img): m_img(img) {}
 
  const T eval() const { return line[x]; }
   void set_y(const int _y) { line = img[y]; }
   void set_x(const int _x) { x = _x; }
 
  const Image<T> &m_img;
  int x;
  T * line;
};

We also need to be able to represent a constant value across the image:

template struct Constant {
  Constant(T c): value(c){}
 
  const T eval() const { return value; }
  void set_y(const int _y) {}
  void set_x(const int _x) {}
};

Now that we can define variables and constants, it is time to do something with them. We can define a binary operator, taking a left and right expression (ExprT1, ExprT2) and a function to apply (F) returning a value of RetType.

template <typename RetType, 
          class ExprT1, class ExprT2, 
          template<typename T> class F>
struct BinOp {
   ExprT1 m_expr1;
   ExprT2 m_expr2;
 
   BinOp (ExprT1 expr1, ExprT2 expr2) : 
     m_expr1(expr1), m_expr2(expr2)  {}
 
   const RetType eval() const 
   { 
      return F<RetType>::apply(m_expr1.eval(), 
		               m_expr2.eval()); 
   }    
 
   void set_y(const int y)
   {
      m_expr1.set_y(y);
      m_expr2.set_y(y);
   } 
   void set_x(const int x)
   {
      m_expr1.set_x(x);
      m_expr2.set_x(x);
   }
};

Together this forms an AST, evaluated at compile time. Overloaded operators return the right type templates. The only piece missing from this is the actual arithmetic operations to be passed as the fourth template argument (F) of the BinOp struct.

template<typename T> struct Add {
	static T apply(const T &a, const T &b) { 
          return a + b; 
       }
};
 
template<typename T> struct Mul {
	static T apply(const T &a, const T &b) { 
          return a * b; 
       }
};

Types get quite convoluted however: a x b + 5.0f is represented by BinOp<float, BinOp<float, Var<float>, Var<float>, Mul>, Constant<float>, Add>.

The interesting thing is that when the compiler generates code for eval() on the expression type representing the expression above, what will happen? Somewhat simplified, this is what takes place:

eval() {
   return Add::apply(Multiplication.eval(),
                     Constant.eval());
}

which when you expand the evaluation another level becomes

eval() {
   return Add(Mul::apply(var_a.eval(),
                         var_b.eval()),
              constant_5.eval());
}

which expands to

eval() {
   return Add::apply(Mul::apply(line_a[x],
                                line_b[x]),
                     5.0f);
}

now we’re starting to get down to the real data. The apply calls will be replaced by arithmetic and we get

eval() {
   return line_a[x] * line_b[x] + 5.0f;
}

which in turn will be inlined into the caller. In the end, the all the overhead simply vanishes in a puff of function inlining, for which our trusted compiler is relied upon.

For clarity, I have omitted the details of how SIMD is implemented, but if you just imagine that the the type T is a SIMD word, it is fairly close. The main difference is that the evaluation is done in chunks of N words, where N depends on the element type and SIMD architecture.

Why?

C++ is an incredibly convoluted language. It is by far the largest and most complex of widely used programming languages. It has an undecidable grammar, unreadable error messages, takes ages to compile and because the language is so large everybody knows and uses a different subset of the language than everyone else. So why use it at all?

Techniques like this are part of the reason: You can write very powerful libraries, combining abstraction and performance. This is something very few languages can offer. As compilers have matured, these types of design techniques have started to appear in many C++ libraries. For example Blitz++[2], MTL[3] use these and other techniques to build high performance C++ libraries that rival handtuned Fortran code. For some more inspiration into these and similar techniques Alexandrescu’s “Modern C++ Design” is a good introduction.[5]

Unfortunately templates weren’t designed to do this, and the syntax shows it. Anyone familiar with lisp macros would just laugh at the hoops C++ forces you to hop though to get it to act as a code generator. But often we live in the real world, where there are many constraints other than code beauty.

References

[1] – T. Veldhuizen. C++ templates are Turing complete. 2003.
[2] – Blitz++.
[3] – Matrix Template Library.
[4] – T. Veldhuizen, “Expression Templates,” C++ Report, Vol. 7 No. 5 June 1995, pp. 26-31.
[5] – Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley.