So, I recently stumbled upon Karan Misra’s post comparing Go and C++ on a neat little business card raytracer which made the rounds a few days ago.
Performance is a tricky matter. Novice programmers have a tendency to first over-optimize everything, then sometime in their career hear Knuth’s “premature optimization is the root of all evil” and then deride everybody who thinks about performance.
C++ is not my favorite language, but it is the one I spend the most time using. I lot of my working days are spent writing the kind of software where performance matters. That isn’t the case for the large majority of programs or programmers, but there are certain kinds of software (typically embedded and real time) where it’s worth the time to spend a few hours thinking about how to make it run fast. Your web site is probably not it, even if you were just featured on engadget. A LTE modem which should transfer 300Mbit while consuming milliwatts is is a better candidate.
Rob Pike is a brilliant programmer and person, and if you need convincing of that, well – just read his Wikipedia page. But I found his post on why C++ programmers haven’t flocked to go quite a bit off the mark. I certainly don’t use C++ instead of go because I like 200 pages of template error messages. But it offers me something I haven’t found in any other language: expressive enough so not feel completely stuck in the 70s but still enough control that I can predict what the machine will do. Go is not a good answer for those cases with its mandatory GC and lack of access to machine primitives.
Because when you actually do code for performance, in those small bits of code in inner loops where it’s warranted to do so, your priorities change. The language you code in ends up being… less relevant, abstractions fade away and you try to divine communication directly with the hardware that will be running your code. The thinking goes from “how do I express this idea clearly in code?” towards how do I pull and poke at the processor so that it executes this with a full pipeline?”. There’s you, the processor and the compiler in some kind of symbiosis and the various transformations that each do matter more than syntax.
So, after four paragraphs of rambling, let me try to stumble back to where I started: a raytracer. I took a look at it to see what you could do if you wrote it like you write software where performance matters. I didn’t want to spend all day doing it, so I stuck to modifying one single function which I reimplemented using G++ vector extensions and Intels AVX instruction set.
It had the below impact on run time – from top to bottom: Go, Original C++ and my optimized version.
My compiler flags were: c++ -std=c++11 -O2 -g -Wall -pthread -ffast-math -mtune=native -march=native (gcc version 4.6.3)
Prior to my optimization (as reported by perf stat):
8863,934376 task-clock # 3,934 CPUs utilized 1 213 context-switches # 0,137 K/sec 7 cpu-migrations # 0,001 K/sec 535 page-faults # 0,060 K/sec 22 063 102 197 cycles # 2,489 GHz [83,27%] 16 064 668 982 stalled-cycles-frontend # 72,81% frontend cycles idle [83,28%] 5 227 501 506 stalled-cycles-backend # 23,69% backend cycles idle [66,79%] 21 652 209 811 instructions # 0,98 insns per cycle # 0,74 stalled cycles per insn [83,38%] 1 979 364 705 branches # 223,305 M/sec [83,36%] 55 751 528 branch-misses # 2,82% of all branches [83,34%] 2,253349546 seconds time elapsed
4056,958385 task-clock # 3,900 CPUs utilized 603 context-switches # 0,149 K/sec 7 cpu-migrations # 0,002 K/sec 538 page-faults # 0,133 K/sec 10 091 407 696 cycles # 2,487 GHz [83,25%] 6 897 321 723 stalled-cycles-frontend # 68,35% frontend cycles idle [82,97%] 2 478 649 915 stalled-cycles-backend # 24,56% backend cycles idle [66,89%] 10 626 263 820 instructions # 1,05 insns per cycle # 0,65 stalled cycles per insn [83,49%] 896 560 476 branches # 220,993 M/sec [83,49%] 53 713 339 branch-misses # 5,99% of all branches [83,52%] 1,040250808 seconds time elapsed
It goes from 22 billion to 10 billion cycles, from 2.2 seconds to 1 second on my not-very-fast-at-all Core i3-2100T. The go version (1.2rc1) takes 5.0 seconds on the same hardware. So the decently optimized c++ version is 2.2x faster than a decently optimized go version. But if you’re willing to really talk to the processor in one single function you can gain an additional 2.2x. And this is before we’ve seriously started to structure the raytracing kernel for performance, I would not be surprised if this was highly optimized production code we’d see another 2-3x. That is the kind of expressiveness that matters, for those few programs where performance matters.
The results for the other image sizes were really quite similar:
Code is on github.
If you’re interested in this kind of silly optimizing-for-favorite-language-until-it-bleeds, you might find Debian’s Computer Language Benchmarks Game fun. And yes, before some eagle eyed commenter notices – it doesn’t work if you have a number of objects not divisible by the number of elements in your SIMD words. But this is a toy and I didn’t want to clutter the code.
Go is a good language, but as Rob found – it offers more to Python and Ruby programmers who gain a good performance gain at little lost expressiveness. But I don’t see it replacing C or C++ as the tool of choice for writing core infrastructure. Unfortunately, because C++ needs replacing. Personally, I’m hoping for Rust.