libera/#sbcl - IRC Chatlog
Search
13:37:44
hayley
Is it likely that I am digging myself into a hole if my next thought on speeding up sweeping in my GC is to use SIMD? I don't think there has been much use of intrinsics e.g. before in the runtime, which makes me nervous. And perhaps it is a sign of a bad algorithm, if scalar code performs poorly.
14:56:26
mfiano
hayley: Have you tried running the machine code through llvm-mce? or uica which should be more accurate
14:58:55
mfiano
I found it very useful for debugging why things are slow. In my recent use, I could see that the SIMD version actually uses more instructions, and executes at fewer instructions per cycle than the scalar version.
14:59:24
mfiano
assembly I got targeting skylake-avx512, but analyzed under the zen2 cost model: https://gist.github.com/mfiano/3377a7e7804f279eaa9478f88062e858
14:59:48
mfiano
Basically, resource [8], the Zen2FPU0 (first Zen2 floating point unit) is hit extremely hard here. The tightest bottleneck determines the speed, so one resource getting hit really hard, much harder than the others, is bad.