libera/#sicl - IRC Chatlog

14:29:01 beach Here are some things to think about, perhaps mainly for hayley, but everyone's thoughts are welcome: What if we did not have per-thread nurseries. We could have a free list of dyads per thread instead to avoid locks on dyads as much as possible. How much more expensive is it to unlink a dyad from a free list than to bump a pointer?

14:29:37 beach It would greatly simplify the garbage collector. No write barrier necessary.

14:29:53 beach No need to update pointers, since objects no longer move.

14:30:17 moon-child the expense does not come when allocating. It comes when freeing, and it comes as a whole-program locality loss

14:30:22 beach That's better for the hardware cache too. Moving objects is not great. And, of course, moving objects take instructions.

14:31:06 beach Freeing is done in a separate thread, or perhaps even several separate threads.

14:31:14 beach So they can run on separate cores.

14:31:39 beach My guess is that it is an excellent use of additional cores.

14:32:31 moon-child sure, but it still has to be done. With semispace gc (say), you do not need to touch objects you free

14:32:31 beach Also, how expensive is malloc()/free() in Doug Lea's collector. My guess is that it is very cheap.

14:33:36 beach moon-child: Yes, I know the arguments.

14:34:06 beach But most arguments I see take only one aspect into account.

14:35:27 beach Typically, the cost of a read/write barrier is not taken into account, nor is the cost of moving objects, nor the effects on the hardware cache.

14:36:07 beach ... nor the cost of updating pointers on the stack and in all heap objects.

14:37:37 moon-child I would be _shocked_ if a copying gc exhibited worse caching behaviour than one such as you propose

14:38:17 moon-child regarding overhead: it's true that it's there, but empirically, GCs which elide such mechanisms, such as bdw and d's gc, perform worse than GCs such as those used by java, which do not

14:38:26 beach Oh? How can not moving objects be worse than moving objects, when it comes to cache performance?

14:39:21 moon-child you take objects which were far apart, and place them next to each other

14:39:25 moon-child reducing fragmentation

14:39:46 beach I don't see the relationship between fragmentation and cache performance.

14:40:16 beach And placing objects next to each other costs a lot of memory accesses.

14:40:49 beach I wonder how much of this data was collected when memory accesses were as cheap as register operations.

14:41:26 moon-child it is based on an assumption regarding program behaviour, which tends to be correct. Two assumptions. The first is made by the gc, which assumes that objects which were allocated close in time will be accessed close in time; so it retains allocation order. The second is made by the cpu, which assumes that when memory at one location is accessed, nearby memory will be accessed close in time too; so

14:41:28 moon-child it will fetch nearby memory into the cache

14:41:51 moon-child moreover a dyad is 16 bytes; a cache line is 64, so even without prefetching, you benefit from locality

14:43:37 moon-child (I believe we were on opposite sides of the same argument a few months ago :P)

14:50:22 beach So, again, I think I know all the arguments for an against. I asked this question to try to start a more in-depth reflection, and perhaps ways to measure the effects of my suggestion.

14:52:02 beach One way to figure things out, which I think we should do, is to have the first version of SICL use only the global collector. It will make things much easier to debug and we could use such a version to attempt to measure the performance impact of this choice.

14:53:13 moon-child sounds reasonable

14:53:25 beach Thanks.

14:55:57 moon-child the effect of such a change is heavily dependent on the performance of the overall system. If generic functions are still slow and there is lots of redundant type-checking, the impact of fragmentation will be lessened. So I think such comparisons ought to be deferred until more core optimizations are performed and the quality of the generated code is reasonable

14:56:26 moon-child (and e.g. elision of barriers)

15:24:45 beach That's a very good point.

15:32:55 Bike thought about it a bit more and i think compile/eval might make keeping source csts/forms in IR more essential. like, you can have (compile nil (make-lambda)), and then there may very well be no source text, but error messages still ought to be as helpful as possible under the circumstances

15:33:29 beach Makes sense.

19:23:23 hayley beach: In the case of thread local nurseries, I am not sure how you benefit from freeing on another thread, if the mutator thread is blocked. But you mention not having a write barrier, which implies not having a generational collector and thus no thread local nurseries.

19:28:32 hayley There are more recent results showing compaction improving locality, e.g. the many collectors compared in <https://dl.acm.org/doi/10.1145/1375581.1375586>.

19:31:51 hayley Re-linking the dyads does not appear free to me, but it can probably be done in an efficient manner. Also, I recall the paper on the Compressor argued that it had some locality, as either background threads would handle fixing up a page, or a mutator thread which trapped fixes it up just before using it.

19:34:11 hayley Amusingly I was thinking about how to implement a global backup compaction pass with little overhead when we don't compact, as no one seems to have investigated compaction for a persistent system like CLOSOS, which could run for years in one heap.

19:35:17 hayley Rather, they haven't investigated fragmentation over years. And I worry that it will add up over time.

19:39:13 hayley I had also thought the thread local nurseries were supposed to be about the size of L2 cache, which eliminates most cache misses.

19:41:13 hayley Another investigation in compacting i recall is <https://shipilev.net/jvm/anatomy-quarks/11-moving-gc-locality/> but it is a veryexaggerated test in my opinion.

19:43:17 hayley There, objects in an array are rearranged to be in the same order as the array, which improves the locality and performance of walking the array later.

20:00:01 hayley I guess that is too exaggerated, as GC costs aren't considered at all, and the micro-benchmark only uses one array. But then, I'd suspect the positioning of cons cells in a list also may cause programs to be more dependent on locality, than if they were written with vectors (or CDR coding, since I've already mentioned paging and MMU traps this morning, so why not?)

21:55:31 hayley That said, there is this paper on a Standard ML implementation using a mark-sweep nursery, which achieves results in the same ballpark as copying: <https://www.pllab.riec.tohoku.ac.jp/papers/icfp2011UenoOhoriOtomoAuthorVersion.pdf>

22:02:25 hayley However, they only test without multi-threading, and I suspect synchronisation costs are going to be important for a multi-threaded collector. But, as you say, you can avoid it by using allocation buffers of some sort. My current hunch is that compacting only when there are enough small holes is the best strategy.

22:05:39 hayley With regards to "What if we did not have per-thread nurseries" though: a thread local nursery requires no synchronisation and can fit in cache, so it should be a good idea based on those properties. We'd need to know how well utilised thread local nurseries are, and how slow the write barrier is, if we want to make a comparison. And, again, I think it would be very useful preliminary work to collate multi-threaded benchmarks which could be used in

22:05:39 hayley such tests.

22:13:16 hayley I'm also currently wondering if performing partial collections, like in the Mature Old Space (aka Train) could help with locality while running the garbage collector. Such a collector requires more a complex write barrier, but e.g. Garbage-First on a HotSpot JVM uses something similar, and was made the default collector in 2017.

22:29:11 zephyr i have been getting into GCs a lot at work lately, and the economy of them is pretty interesting. it's a little conspicuous there isn't "one true GC" and that their performance can vary so widely by workload

22:29:33 zephyr ZGC is the one i've been monitoring and using the most lately. weird thing.

23:06:19 hayley There are now generational versions of ZGC and Shenandoah which reduce the memory overhead in my experience (which consists of playing Minecraft).

23:06:43 hayley Well, both are in development and you need to build your own JVM, but they exist.

23:13:13 zephyr oh, interesting

23:17:34 hayley You could also get a free copy of the Azul JVM, which has a generational collector with basically the same design (C4). But I can't legally disclose my thoughts on it.

23:18:17 hayley (It's free as in it costs zero dollars, and proprietary still.)

0:22:53 moon-child oh there's generational zgc now!