libera/#sbcl - IRC Chatlog
Search
14:56:26
mfiano
hayley: Have you tried running the machine code through llvm-mce? or uica which should be more accurate
14:58:55
mfiano
I found it very useful for debugging why things are slow. In my recent use, I could see that the SIMD version actually uses more instructions, and executes at fewer instructions per cycle than the scalar version.
14:59:24
mfiano
assembly I got targeting skylake-avx512, but analyzed under the zen2 cost model: https://gist.github.com/mfiano/3377a7e7804f279eaa9478f88062e858
14:59:48
mfiano
Basically, resource [8], the Zen2FPU0 (first Zen2 floating point unit) is hit extremely hard here. The tightest bottleneck determines the speed, so one resource getting hit really hard, much harder than the others, is bad.
22:51:45
hayley
I'm fairly sure SIMD would be effective (for more or less finding a particular byte in a haystack). More that, if I have to use it, am I doing something wrong in design?
22:54:09
hayley
Arguably it's quite silly to be walking a vector 1/128 the size of the heap all the time, and needing SIMD acceleration is going to make the performance fix unportable between architectures. (Could SWAR like there is for card marks, still.)
23:14:26
mfiano
I was mostly just pointing out that you could see which CPU resources are getting hit right now with the scalar version.
23:14:50
mfiano
It might make assessing the situation easier and get an overview of how well it can be vectorized
23:50:10
hayley
Now that you mention it, but not quite. I have to check for two bytes rather: the same gen and marked case, and the same gen and unmarked case. And then I do a sort of blending operation to sweep allocation bits, which would be nice with SIMD (though...I don't see why a compiler can't vectorise that, if I make the loop more obvious).
23:51:18
hayley
Perhaps better to separate my one big loop into a few to clue the compiler in. And one pass could just use memchr, sure.