On comparing languages, C++ and Go

October 5th, 2013

So, I recently stumbled upon Karan Misra’s post comparing Go and C++ on a neat little business card raytracer which made the rounds a few days ago.

Performance is a tricky matter. Novice programmers have a tendency to first over-optimize everything, then sometime in their career hear Knuth’s “premature optimization is the root of all evil” and then deride everybody who thinks about performance.

C++ is not my favorite language, but it is the one I spend the most time using. I lot of my working days are spent writing the kind of software where performance matters. That isn’t the case for the large majority of programs or programmers, but there are certain kinds of software (typically embedded and real time) where it’s worth the time to spend a few hours thinking about how to make it run fast. Your web site is probably not it, even if you were just featured on engadget. A LTE modem which should transfer 300Mbit while consuming milliwatts is is a better candidate.

Rob Pike is a brilliant programmer and person, and if you need convincing of that, well – just read his Wikipedia page. But I found his post on why C++ programmers haven’t flocked to go quite a bit off the mark. I certainly don’t use C++ instead of go because I like 200 pages of template error messages. But it offers me something I haven’t found in any other language: expressive enough so not feel completely stuck in the 70s but still enough control that I can predict what the machine will do. Go is not a good answer for those cases with its mandatory GC and lack of access to machine primitives.

Because when you actually do code for performance, in those small bits of code in inner loops where it’s warranted to do so, your priorities change. The language you code in ends up being… less relevant, abstractions fade away and you try to divine communication directly with the hardware that will be running your code. The thinking goes from “how do I express this idea clearly in code?” towards how do I pull and poke at the processor so that it executes this with a full pipeline?”. There’s you, the processor and the compiler in some kind of symbiosis and the various transformations that each do matter more than syntax.

So, after four paragraphs of rambling, let me try to stumble back to where I started: a raytracer. I took a look at it to see what you could do if you wrote it like you write software where performance matters. I didn’t want to spend all day doing it, so I stuck to modifying one single function which I reimplemented using G++ vector extensions and Intels AVX instruction set.

It had the below impact on run time – from top to bottom: Go, Original C++ and my optimized version.

My compiler flags were: c++ -std=c++11 -O2 -g -Wall -pthread -ffast-math -mtune=native -march=native (gcc version 4.6.3)

Prior to my optimization (as reported by perf stat):

       8863,934376 task-clock                #    3,934 CPUs utilized          
             1 213 context-switches          #    0,137 K/sec                  
                 7 cpu-migrations            #    0,001 K/sec                  
               535 page-faults               #    0,060 K/sec                  
    22 063 102 197 cycles                    #    2,489 GHz                     [83,27%]
    16 064 668 982 stalled-cycles-frontend   #   72,81% frontend cycles idle    [83,28%]
     5 227 501 506 stalled-cycles-backend    #   23,69% backend  cycles idle    [66,79%]
    21 652 209 811 instructions              #    0,98  insns per cycle        
                                             #    0,74  stalled cycles per insn [83,38%]
     1 979 364 705 branches                  #  223,305 M/sec                   [83,36%]
        55 751 528 branch-misses             #    2,82% of all branches         [83,34%]

       2,253349546 seconds time elapsed

and afterwards:

       4056,958385 task-clock                #    3,900 CPUs utilized          
               603 context-switches          #    0,149 K/sec                  
                 7 cpu-migrations            #    0,002 K/sec                  
               538 page-faults               #    0,133 K/sec                  
    10 091 407 696 cycles                    #    2,487 GHz                     [83,25%]
     6 897 321 723 stalled-cycles-frontend   #   68,35% frontend cycles idle    [82,97%]
     2 478 649 915 stalled-cycles-backend    #   24,56% backend  cycles idle    [66,89%]
    10 626 263 820 instructions              #    1,05  insns per cycle        
                                             #    0,65  stalled cycles per insn [83,49%]
       896 560 476 branches                  #  220,993 M/sec                   [83,49%]
        53 713 339 branch-misses             #    5,99% of all branches         [83,52%]

       1,040250808 seconds time elapsed

It goes from 22 billion to 10 billion cycles, from 2.2 seconds to 1 second on my not-very-fast-at-all Core i3-2100T. The go version (1.2rc1) takes 5.0 seconds on the same hardware. So the decently optimized c++ version is 2.2x faster than a decently optimized go version. But if you’re willing to really talk to the processor in one single function you can gain an additional 2.2x. And this is before we’ve seriously started to structure the raytracing kernel for performance, I would not be surprised if this was highly optimized production code we’d see another 2-3x. That is the kind of expressiveness that matters, for those few programs where performance matters.

The results for the other image sizes were really quite similar:

Code is on github.

If you’re interested in this kind of silly optimizing-for-favorite-language-until-it-bleeds, you might find Debian’s Computer Language Benchmarks Game fun. And yes, before some eagle eyed commenter notices – it doesn’t work if you have a number of objects not divisible by the number of elements in your SIMD words. But this is a toy and I didn’t want to clutter the code.

Go is a good language, but as Rob found – it offers more to Python and Ruby programmers who gain a good performance gain at little lost expressiveness. But I don’t see it replacing C or C++ as the tool of choice for writing core infrastructure. Unfortunately, because C++ needs replacing. Personally, I’m hoping for Rust.

Future readers might be interested in checking out the Hacker news and /r/programming posts for more detail and discussion.

A programmer’s time

July 18th, 2009

A programmer’s time in front of a computer is to a large extent spent manipulating text. While mundane compared to the actual knowledge of implementing systems, it’s the basic trade craft of programming. So we care about text editors. A lot. There’s been feuds among programmers which editor is the best since the beginning of time (that is, 1950′s, or so).

The difference, in my mind, between a good programmer’s editor and a great one is thousands of small little tweaks. In a great editor, you can quickly and easily automate repetitive tasks and there are commands for handling hundreds of things that may not be commands that you do every time – but once you learn enough of them you’ll end up saving a lot of time. This is a lot of the reason why editors like emacs and vim which have been around for decades are still used and loved: They’ve been around long enough so that people have written all these clever shortcuts which makes a programmer’s job easier. Replicating these decades of tweaks is not an easy task.

I discovered Qt Creator a few weeks ago. It definitely holds a lot of promise, especially considering its age – it’s only been out and public for a few months; though obviously the Trolltech people have been working on it internally for longer than that. As an IDE, it’s absolutely fantastic. It has some of the feel of the Java or C# IDE’s that everybody else has been getting jealous over for the last decade. As an editor however, there’s work to be done.

One fairly common task in C++ is that you have a header file, like the one below, and you want to rename the class Foo to something else.

#ifndef FOO_H
#define FOO_H
 
class Foo {
    void foo();
}
 
#endif // FOO_H

Say we want to rename everything from foo to bar. Easily done, just replace foo with bar globally. This is what we’ll get in most editors:

#ifndef bar_H
#define bar_H
 
class bar {
    void bar();
}
 
#endif // bar_H

You’ll see that the editor did exactly what we requested. Replaced foo with bar. Now, unfortunately, that means that some poor schmuch (that is, me) will have to go back and fix the capitalization, because all good programmers care about consistency and code style and not to mention that bar and Bar are two different things in almost all programming languages.

If you do the replace command in emacs however, by default it’ll preserve the case of the replaced text:

#ifndef BAR_H
#define BAR_H
 
class Bar {
    void bar();
}
 
#endif // BAR_H

It saves me maybe 30 seconds of typing each time, but over time it adds up. More importantly, I don’t feel like the editor is interrupting me. I can continue with what I wanted to accomplish, without breaking my flow with trivial details such as the case of a few words.

So, what does this have to do with Qt Creator that I mentioned earlier? I took a look at the source code of it, and found it rather pleasant to fix and hack around in.

So I created a patch which implements this case preserving replace in Qt Creator. There are hundreds of other small tweaks that would make the editor in Creator go from a good to a great editor, hopefully I can continue helping by implementing at least a few more.

not quite gotten it, have they?

January 19th, 2009

vs2008

Microsoft hasn’t really gotten this whole Internet thing yet, have they? From the VS 2008 installer.

Abstraction vs. Performance

January 16th, 2009

With most applications these days, there’s no need to worry about performance. Modern computers are amazingly fast, fast enough that most programmers never need to worry about optimization beyond making sure they don’t use overly naive algorithms (or in most cases, not even that: just buy another server for your cool web 2.0 site). However, in some cases you do need to spend some time optimizing. Sometimes you don’t have the luxury of controlling what hardware your application runs on.

When you are forced to care about low level performance, one question arises: Isn’t abstraction bad for performance? Conventional wisdom is that in order to squeeze good performance out of processors, you’ll need to write low level code. Conventional wisdom is mostly true too, but if you carefully employ certain abstractions that don’t inhibit compiler optimization you can get very close to peak machine performance with high level syntax.

There is a movement in the C++ community called generic programming which focuses on this sort of careful use of abstractions. A few other languages also support this style of programming, most notably Haskell and D, but C++ is the most used language in this particular area.

Example

I’m going to show a small example :) Image processing is a task that often involves applying a formula to million of pixels. Especially if you want to do real time processing of video streams, it’s not something you can just easily slap together in PHP or Ruby. Just as an example, assume you’d want to calculate an out image based on taking two images, calculating some silly formula like out = x² + c + y + y³. where x,y are floating point variables representing each pixel and c is some constant.

Implemented in SSE2 SIMD intrinsics it might look something like this:

for(int y=0;y<out.height;y++) {
      __m128 *imgP = (__m128*)img[y];            
      __m128 *img2P = (__m128*)img2[y]);
      __m128 *outP = (__m128*)out[y];
      __m128 iterP = _mm_set_ps1((float)constant);  
      for(int x=0;x<out.width;x+=4) {        
        __m128 m1 = _mm_mul_ps(*imgP, *imgP); 
        __m128 m2 = _mm_add_ps(iterP,*img2P); 
        __m128 m3 = _mm_add_ps(m1,m2);         
        __m128 m4 = _mm_mul_ps(*img2P,*img2P);
        __m128 m5 = _mm_mul_ps(m4,*img2P);     
        *outP = _mm_add_ps(m3,m5);             
        *outP++;
        *imgP++;
        *img2P++;
    }
}

Now, that was quite unreadable. It matlab or octave it might have been written like

out = x .* x .+ constant .+ y .+ y.*y.*y;

using the elementwise version of the arithmetic operations. Wouldn’t it be nice if we could express it with similar syntax in C++:

Var x(img);
Var y(img2);
 
out = x*x + constant + y + y*y*y;

And in fact, we can. Using operator overloading and a technique called expression templates, we can get quite good performance out of readable syntax. Here is a comparison of the two implementations, with the horizontal scale as clock cycles per pixel:

performance_comparison

If we take a peek at what the compiler did by looking at the generated assembly we see something surprising: The high level, abstracted, generic, portable code and low level code generated almost exactly the same code. Can you guess which one is which?

L42:                                 L71:
  movl -60(%ebp), %eax                 movl -68(%ebp), %eax
  movl -4(%eax,%edi,4), %esi           movl -4(%eax,%edi,4), %esi
  movl -68(%ebp), %edx                 movl -64(%ebp), %ecx
  movl -4(%edx,%edi,4), %ecx           movl -4(%ecx,%edi,4), %edx
  movl -64(%ebp), %eax                 movl -52(%ebp), %eax
  movl -4(%eax,%edi,4), %edx           movl -4(%eax,%edi,4), %ecx
  movl  $4, %eax                       movl  $4, %eax
  .align  4,0x90                       .align  4,0x90
L43:                                 L72:
  movaps -16(%edx,%eax,4), %xmm3       movaps -16(%esi,%eax,4), %xmm2
  movaps %xmm3, %xmm2                  mulps  %xmm2, %xmm2
  mulps	 %xmm3, %xmm2                  movaps  %xmm0, %xmm1
  mulps	 %xmm3, %xmm2                  addps  -16(%edx,%eax,4), %xmm1
  movaps -16(%ecx,%eax,4), %xmm1       addps   %xmm1, %xmm2
  mulps	 %xmm1, %xmm1                  movaps -16(%edx,%eax,4), %xmm1
  addps	 %xmm0, %xmm1                  mulps   %xmm1, %xmm1
  addps	 %xmm3, %xmm1                  mulps  -16(%edx,%eax,4), %xmm1
  addps	 %xmm2, %xmm1                  addps   %xmm1, %xmm2
  movaps %xmm1, -16(%esi,%eax,4)       movaps  %xmm2, -16(%ecx,%eax,4)
  addl	 $4, %eax                      addl    $4, %eax
  cmpl	 $516, %eax                    cmpl    $516, %eax
  jne L43                              jne  L72
  incl   %edi                          incl    %edi
  cmpl	 $513, %edi                    cmpl    $513, %edi
  jne L42                              jne  L71

(gcc-4.2 -S -march=core2 -msse2 -O3)

The one on the left is actually the expression template version, while the right is the one generated from the SSE intrinsics. In this case gcc has actually even done a slightly better job on the high level code, eliminating a few redundant memory accesses present in the intrinsics code.

How?

An accidental side effect of templates was a turing-complete pure functional code generation language in C++[1]. The basic idea behind the technique is to use that as a code generator to build optimal code from nice syntax.

Expression templates

Normally operator overloading in C++ results in many temporaries and poor performance. The basic idea behind expression templates is to encode abstract syntax tree of the expression in the type of the variable during compilation, and defer evaluation until later.

The trick is that the expression is parsed at compile time, and stored as nested template
arguments of an “expression type”. This allows the compiler to do optimization on the whole expression. It is not a new technique, in fact it was invented in 1994 by T. Veldhuizen[4].

So, if we wanted to implement something like the above, how would it be done? Lets give it a try. First we would need to be able to represent a pixel from an image.

template<typename T> struct Var {
  Var(const Image<float> &img): m_img(img) {}
 
  const T eval() const { return line[x]; }
   void set_y(const int _y) { line = img[y]; }
   void set_x(const int _x) { x = _x; }
 
  const Image<T> &m_img;
  int x;
  T * line;
};

We also need to be able to represent a constant value across the image:

template struct Constant {
  Constant(T c): value(c){}
 
  const T eval() const { return value; }
  void set_y(const int _y) {}
  void set_x(const int _x) {}
};

Now that we can define variables and constants, it is time to do something with them. We can define a binary operator, taking a left and right expression (ExprT1, ExprT2) and a function to apply (F) returning a value of RetType.

template <typename RetType, 
          class ExprT1, class ExprT2, 
          template<typename T> class F>
struct BinOp {
   ExprT1 m_expr1;
   ExprT2 m_expr2;
 
   BinOp (ExprT1 expr1, ExprT2 expr2) : 
     m_expr1(expr1), m_expr2(expr2)  {}
 
   const RetType eval() const 
   { 
      return F<RetType>::apply(m_expr1.eval(), 
		               m_expr2.eval()); 
   }    
 
   void set_y(const int y)
   {
      m_expr1.set_y(y);
      m_expr2.set_y(y);
   } 
   void set_x(const int x)
   {
      m_expr1.set_x(x);
      m_expr2.set_x(x);
   }
};

Together this forms an AST, evaluated at compile time. Overloaded operators return the right type templates. The only piece missing from this is the actual arithmetic operations to be passed as the fourth template argument (F) of the BinOp struct.

template<typename T> struct Add {
	static T apply(const T &a, const T &b) { 
          return a + b; 
       }
};
 
template<typename T> struct Mul {
	static T apply(const T &a, const T &b) { 
          return a * b; 
       }
};

Types get quite convoluted however: a x b + 5.0f is represented by BinOp<float, BinOp<float, Var<float>, Var<float>, Mul>, Constant<float>, Add>.

The interesting thing is that when the compiler generates code for eval() on the expression type representing the expression above, what will happen? Somewhat simplified, this is what takes place:

eval() {
   return Add::apply(Multiplication.eval(),
                     Constant.eval());
}

which when you expand the evaluation another level becomes

eval() {
   return Add(Mul::apply(var_a.eval(),
                         var_b.eval()),
              constant_5.eval());
}

which expands to

eval() {
   return Add::apply(Mul::apply(line_a[x],
                                line_b[x]),
                     5.0f);
}

now we’re starting to get down to the real data. The apply calls will be replaced by arithmetic and we get

eval() {
   return line_a[x] * line_b[x] + 5.0f;
}

which in turn will be inlined into the caller. In the end, the all the overhead simply vanishes in a puff of function inlining, for which our trusted compiler is relied upon.

For clarity, I have omitted the details of how SIMD is implemented, but if you just imagine that the the type T is a SIMD word, it is fairly close. The main difference is that the evaluation is done in chunks of N words, where N depends on the element type and SIMD architecture.

Why?

C++ is an incredibly convoluted language. It is by far the largest and most complex of widely used programming languages. It has an undecidable grammar, unreadable error messages, takes ages to compile and because the language is so large everybody knows and uses a different subset of the language than everyone else. So why use it at all?

Techniques like this are part of the reason: You can write very powerful libraries, combining abstraction and performance. This is something very few languages can offer. As compilers have matured, these types of design techniques have started to appear in many C++ libraries. For example Blitz++[2], MTL[3] use these and other techniques to build high performance C++ libraries that rival handtuned Fortran code. For some more inspiration into these and similar techniques Alexandrescu’s “Modern C++ Design” is a good introduction.[5]

Unfortunately templates weren’t designed to do this, and the syntax shows it. Anyone familiar with lisp macros would just laugh at the hoops C++ forces you to hop though to get it to act as a code generator. But often we live in the real world, where there are many constraints other than code beauty.

References

[1] – T. Veldhuizen. C++ templates are Turing complete. 2003.
[2] – Blitz++.
[3] – Matrix Template Library.
[4] – T. Veldhuizen, “Expression Templates,” C++ Report, Vol. 7 No. 5 June 1995, pp. 26-31.
[5] – Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley.

Hello world!

December 30th, 2008

Well. I guess I’ll give this another try.

int main()
{
  printf("Hello world!\n");
}