libera/#sbcl - IRC Chatlog
Search
0:51:05
hayley
luis: Something to ponder. On my laptop with 8MB of cache, using a GC that never moves ends up doubling the cache miss rate and slows things down substantially. My desktop with 64MB of cache only has about 20% more misses, and is hardly slower at all. I don't really have any ideas on how to trigger defrag based on cache miss rate.
4:56:51
hayley
Also just got parallel GC running; this time it doesn't seem to want to bite my head off, but still isn't awfully fast.
6:04:30
flip214
hayley: that sounds like you already measured? if not, there's perf, ebpf, and eg https://github.com/iovisor/bcc/blob/master/tools/llcstat.py (typically in a ebpf package)
6:04:51
flip214
for triggering you'd need to use the cpu counters - I don't think they're available for userspace
6:05:38
hayley
perf can get them on my machine without needing root, but I can't exactly compare to the "ideal" miss rate either.
6:06:30
flip214
mine says "Access to performance monitoring and observability operations is limited."
6:06:53
flip214
ah, but there's /proc/sys/kernel/perf_event_paranoid ... which you can't force on your users, I guess
6:21:15
flip214
https://www.1024cores.net/home/lock-free-algorithms/stacks is what I read quite some time ago
6:22:50
hayley
A CL version isn't going to be useful, as I am very prone to ABA-ing myself. And somehow the 1024cores page refuses to show anything here.
6:22:50
flip214
if you don't need an unbounded stack, perhaps you might get away with atomic-incf and -decf and a vector?
6:32:16
hayley
Sure. I'd need CMPXCHG16B or equivalent, but the Steam hardware survey says 99.99% of machines surveyed have it.
6:33:13
hayley
Hm, we're only up to 52 bits of address space for x86-64, right? Could put a reuse counter in the most significant bits of the pointer.
6:35:27
hayley
It might be prone to an ABA problem; which is why I can't naively implement a lock-free stack either.
6:37:02
Shinmera
Anyway, the original paper even uses C++-style pseudo code. https://timharris.uk/papers/2001-disc.pdf
6:39:43
hayley
It's reclaiming memory i.e. reusing pointers that foils things if you're unlucky. The paper doesn't appear to mention it; maybe it's immune to ABA, I'm not sure.
6:40:53
Shinmera
ABA is the most basic issue when implementing a linked list, it'd be pretty bad if it wasn't immune.
6:45:14
hayley
Larry suggested that I could dodge the problem for a while by using per-thread free lists, reducing the load on a (locked) free list (for the mark queue), rather than using a lock-free free list. Might just go with that for now.
6:58:40
flip214
https://lwn.net/Articles/590243/ talks about thread-"local" locks in the linux-kernel (to reduce contention), in case that helps
7:37:33
hayley
Next lock to crack would be free_pages_lock, which is what it is. Likely that I end up needing to grab more pages, as pages aren't always all free. Could increase the page size, or I had an idea for allowing the slow path for claiming single pages to work concurrently, with a readers-writer lock. With adequate atomics, the more common single-page requests can be handled as "readers", and we only need to serialise properly for multiple-page requests.