After reading “SIMD at Insomniac Games – How we do the shuffle”  I realised I hadn’t done anything with my new Raspberry Pi 2 yet. First thing I decided was to try and figure out whether the NEON unit in it was any good, and so I choose a really simple example, calculating a ton of dot products on 4 element vectors, and wrote a bunch of different implementations to try and solve this.

The results are surprising - right enough I did manage (via using inline asm) to get a NEON version going faster than everything else - but what shocked me is that I couldn’t get clang to output faster vector code than the scalar version. I tried a ton of different methods (pragma unroll, manual unroll, inlining on and off) and yet the scalar implementation was always faster.

It seems that at least part of the slowdown is that all the vector versions use the vldn load instructions, whereas the scalar ones (and my fastest hand written version) use vldmia to do the calculation.

My best attempt was;

void ssvaimu8(
  uint32_t length,
  float* o,
  float* x,
  float* y) {
  float* oEnd = o + length;
  __asm__ __volatile__(
    "vldmia %P1!, {q0-q7}\n\t"
    "vldmia %P2!, {q8-q15}\n\t"
    "pld [%P1]\n\t"
    "pld [%P2]\n\t"
    "vmul.f32 q0, q0, q8\n\t"
    "vmul.f32 q1, q1, q9\n\t"
    "vmul.f32 q2, q2, q10\n\t"
    "vmul.f32 q3, q3, q11\n\t"
    "vmul.f32 q4, q4, q12\n\t"
    "vmul.f32 q5, q5, q13\n\t"
    "vmul.f32 q6, q6, q14\n\t"
    "vmul.f32 q7, q7, q15\n\t"
    "vpadd.f32 d0, d0, d1\n\t"
    "vpadd.f32 d2, d2, d3\n\t"
    "vpadd.f32 d4, d4, d5\n\t"
    "vpadd.f32 d6, d6, d7\n\t"
    "vpadd.f32 d0, d0, d2\n\t"
    "vpadd.f32 d1, d4, d6\n\t"
    "vpadd.f32 d8, d8, d9\n\t"
    "vpadd.f32 d10, d10, d11\n\t"
    "vpadd.f32 d12, d12, d13\n\t"
    "vpadd.f32 d14, d14, d15\n\t"
    "vpadd.f32 d2, d8, d10\n\t"
    "vpadd.f32 d3, d12, d14\n\t"
    "vstmia %P0!, {s0-s7}\n\t"
    "cmp %P0, %P3\n\t"
    "pld [%P0]\n\t"
    "blt label4;\n\t"
    : "r" (o),
      "r" (x),
      "r" (y),
      "r" (oEnd)

I do wonder if I could hide more of the cost of the loads/stores by double buffering the workload - only other thing I can think to try!

If you want to check out the rest of the code yourself, or run the examples here is the test file.