This project has moved and is read-only. For the latest updates, please go here.


SpriteBatch: possible performance issue under x64


I ported DirectXTk SpriteBatch to SharpDX last month, and recently found a weird behavior in x64 mode, that could potentially affect DirectXTk version as well.
It seems that using map and filling the vertex buffer is a lot slower when doing this in x64 than in x86. But if you render all the vertex buffer to a temporary buffer, and then copy this buffer to the locked region returned by Map, the performance is similar to x86 (I had something like x4-5 difference, from 400FPS to 90FPS when drawing 5000 sprites).
I suspect DirectXTk SpriteBatch to have the same problem, as the code is very similar.

The code change in SharpDX is here:
Closed Feb 6, 2013 at 2:33 AM by ShawnHargreaves
I cannot repro this issue. I made a test that draws several thousand sprites in a tight loop. Results from release build on my two machines shows x64 is noticeably faster, which is exactly what I’d expect given the more efficient FPU instruction set:
Laptop (Intel HD 3000)
    x86 = 0.24
    x64 = 0.18

Desktop (NVidia Quadro FX 580)
    x86 = 0.13
    x64 = 0.11
I can only think that your version ran into a write-combining cache behavior due to not writing the output buffer in ascending order. Perhaps your C# port of the algorithm didn't preserve the output order of the original DirectXMath buffer populate code, or perhaps the .NET codegen reordered these writes for some reason?


walbourn wrote Nov 16, 2012 at 1:42 AM

What OS and driver/vendor was this issue found on?

alexandre_mutel wrote Dec 18, 2012 at 3:06 PM

If I remember well, I had this issue on a Win7 x64 machine with NVidia GTX580 and latest WHQL drivers as well as on a Win8 x64 machine with an ATI 6400M with latest drivers.

I did not try to setup a reproducible program using DirectXTk, so I'm just speculating about this issue as I did almost a simple and one-to-one port from DirectXTk code.

walbourn wrote Dec 18, 2012 at 7:02 PM

It sounds like the difference between write-combined and write-through cache behavior. Thanks for the heads-up. We can see about adding some perf tests to our internal suite.