libera/#sbcl - IRC Chatlog

0:51:05 hayley luis: Something to ponder. On my laptop with 8MB of cache, using a GC that never moves ends up doubling the cache miss rate and slows things down substantially. My desktop with 64MB of cache only has about 20% more misses, and is hardly slower at all. I don't really have any ideas on how to trigger defrag based on cache miss rate.

4:56:51 hayley Also just got parallel GC running; this time it doesn't seem to want to bite my head off, but still isn't awfully fast.

6:04:30 flip214 hayley: that sounds like you already measured? if not, there's perf, ebpf, and eg https://github.com/iovisor/bcc/blob/master/tools/llcstat.py (typically in a ebpf package)

6:04:44 hayley I did, yes.

6:04:51 flip214 for triggering you'd need to use the cpu counters - I don't think they're available for userspace

6:05:03 flip214 at least, non-uid 0

6:05:15 flip214 you could try timing via TSC or so...

6:05:38 hayley perf can get them on my machine without needing root, but I can't exactly compare to the "ideal" miss rate either.

6:06:18 flip214 is your perf setuid?

6:06:30 flip214 mine says "Access to performance monitoring and observability operations is limited."

6:06:51 hayley Doesn't look like it.

6:06:53 flip214 ah, but there's /proc/sys/kernel/perf_event_paranoid ... which you can't force on your users, I guess

6:07:04 hayley One thing at a time, still. Trying to work out parallel GC again.

6:13:02 hayley Any recommendations if I do end up needing a lock-free stack?

6:21:15 flip214 https://www.1024cores.net/home/lock-free-algorithms/stacks is what I read quite some time ago

6:21:53 flip214 and google gives me https://gist.github.com/basp1/bebf1ecf43ddcd2e9b9a583d98833d08

6:22:50 hayley A CL version isn't going to be useful, as I am very prone to ABA-ing myself. And somehow the 1024cores page refuses to show anything here.

6:22:50 flip214 if you don't need an unbounded stack, perhaps you might get away with atomic-incf and -decf and a vector?

6:23:36 hayley I don't think I can provide reasonable bounds.

6:27:33 flip214 https://github.com/skeeto/lstack perhaps?

6:32:16 hayley Sure. I'd need CMPXCHG16B or equivalent, but the Steam hardware survey says 99.99% of machines surveyed have it.

6:33:13 hayley Hm, we're only up to 52 bits of address space for x86-64, right? Could put a reuse counter in the most significant bits of the pointer.

6:33:21 Shinmera hayley: luckless' caslist can be used as a stack

6:33:51 hayley Shinmera: Indeed, but I need a lock-free list in C specifically.

6:34:04 Shinmera translating it to C should be trivial

6:35:27 hayley It might be prone to an ABA problem; which is why I can't naively implement a lock-free stack either.

6:35:40 Shinmera Why would it be?

6:37:02 Shinmera Anyway, the original paper even uses C++-style pseudo code. https://timharris.uk/papers/2001-disc.pdf

6:39:43 hayley It's reclaiming memory i.e. reusing pointers that foils things if you're unlucky. The paper doesn't appear to mention it; maybe it's immune to ABA, I'm not sure.

6:40:53 Shinmera ABA is the most basic issue when implementing a linked list, it'd be pretty bad if it wasn't immune.

6:45:14 hayley Larry suggested that I could dodge the problem for a while by using per-thread free lists, reducing the load on a (locked) free list (for the mark queue), rather than using a lock-free free list. Might just go with that for now.

6:58:40 flip214 https://lwn.net/Articles/590243/ talks about thread-"local" locks in the linux-kernel (to reduce contention), in case that helps

7:37:33 hayley Next lock to crack would be free_pages_lock, which is what it is. Likely that I end up needing to grab more pages, as pages aren't always all free. Could increase the page size, or I had an idea for allowing the slow path for claiming single pages to work concurrently, with a readers-writer lock. With adequate atomics, the more common single-page requests can be handled as "readers", and we only need to serialise properly for multiple-page requests.