libera/#sicl - IRC Chatlog
Search
14:29:01
beach
Here are some things to think about, perhaps mainly for hayley, but everyone's thoughts are welcome: What if we did not have per-thread nurseries. We could have a free list of dyads per thread instead to avoid locks on dyads as much as possible. How much more expensive is it to unlink a dyad from a free list than to bump a pointer?
14:30:17
moon-child
the expense does not come when allocating. It comes when freeing, and it comes as a whole-program locality loss
14:30:22
beach
That's better for the hardware cache too. Moving objects is not great. And, of course, moving objects take instructions.
14:32:31
moon-child
sure, but it still has to be done. With semispace gc (say), you do not need to touch objects you free
14:32:31
beach
Also, how expensive is malloc()/free() in Doug Lea's collector. My guess is that it is very cheap.
14:35:27
beach
Typically, the cost of a read/write barrier is not taken into account, nor is the cost of moving objects, nor the effects on the hardware cache.
14:37:37
moon-child
I would be _shocked_ if a copying gc exhibited worse caching behaviour than one such as you propose
14:38:17
moon-child
regarding overhead: it's true that it's there, but empirically, GCs which elide such mechanisms, such as bdw and d's gc, perform worse than GCs such as those used by java, which do not
14:38:26
beach
Oh? How can not moving objects be worse than moving objects, when it comes to cache performance?
14:40:49
beach
I wonder how much of this data was collected when memory accesses were as cheap as register operations.
14:41:26
moon-child
it is based on an assumption regarding program behaviour, which tends to be correct. Two assumptions. The first is made by the gc, which assumes that objects which were allocated close in time will be accessed close in time; so it retains allocation order. The second is made by the cpu, which assumes that when memory at one location is accessed, nearby memory will be accessed close in time too; so
14:41:51
moon-child
moreover a dyad is 16 bytes; a cache line is 64, so even without prefetching, you benefit from locality
14:50:22
beach
So, again, I think I know all the arguments for an against. I asked this question to try to start a more in-depth reflection, and perhaps ways to measure the effects of my suggestion.
14:52:02
beach
One way to figure things out, which I think we should do, is to have the first version of SICL use only the global collector. It will make things much easier to debug and we could use such a version to attempt to measure the performance impact of this choice.
14:55:57
moon-child
the effect of such a change is heavily dependent on the performance of the overall system. If generic functions are still slow and there is lots of redundant type-checking, the impact of fragmentation will be lessened. So I think such comparisons ought to be deferred until more core optimizations are performed and the quality of the generated code is reasonable
15:32:55
Bike
thought about it a bit more and i think compile/eval might make keeping source csts/forms in IR more essential. like, you can have (compile nil (make-lambda)), and then there may very well be no source text, but error messages still ought to be as helpful as possible under the circumstances
19:23:23
hayley
beach: In the case of thread local nurseries, I am not sure how you benefit from freeing on another thread, if the mutator thread is blocked. But you mention not having a write barrier, which implies not having a generational collector and thus no thread local nurseries.
19:28:32
hayley
There are more recent results showing compaction improving locality, e.g. the many collectors compared in <https://dl.acm.org/doi/10.1145/1375581.1375586>.
19:31:51
hayley
Re-linking the dyads does not appear free to me, but it can probably be done in an efficient manner. Also, I recall the paper on the Compressor argued that it had some locality, as either background threads would handle fixing up a page, or a mutator thread which trapped fixes it up just before using it.
19:34:11
hayley
Amusingly I was thinking about how to implement a global backup compaction pass with little overhead when we don't compact, as no one seems to have investigated compaction for a persistent system like CLOSOS, which could run for years in one heap.
19:35:17
hayley
Rather, they haven't investigated fragmentation over years. And I worry that it will add up over time.
19:39:13
hayley
I had also thought the thread local nurseries were supposed to be about the size of L2 cache, which eliminates most cache misses.
19:41:13
hayley
Another investigation in compacting i recall is <https://shipilev.net/jvm/anatomy-quarks/11-moving-gc-locality/> but it is a veryexaggerated test in my opinion.
19:43:17
hayley
There, objects in an array are rearranged to be in the same order as the array, which improves the locality and performance of walking the array later.
20:00:01
hayley
I guess that is too exaggerated, as GC costs aren't considered at all, and the micro-benchmark only uses one array. But then, I'd suspect the positioning of cons cells in a list also may cause programs to be more dependent on locality, than if they were written with vectors (or CDR coding, since I've already mentioned paging and MMU traps this morning, so why not?)
21:55:31
hayley
That said, there is this paper on a Standard ML implementation using a mark-sweep nursery, which achieves results in the same ballpark as copying: <https://www.pllab.riec.tohoku.ac.jp/papers/icfp2011UenoOhoriOtomoAuthorVersion.pdf>
22:02:25
hayley
However, they only test without multi-threading, and I suspect synchronisation costs are going to be important for a multi-threaded collector. But, as you say, you can avoid it by using allocation buffers of some sort. My current hunch is that compacting only when there are enough small holes is the best strategy.
22:05:39
hayley
With regards to "What if we did not have per-thread nurseries" though: a thread local nursery requires no synchronisation and can fit in cache, so it should be a good idea based on those properties. We'd need to know how well utilised thread local nurseries are, and how slow the write barrier is, if we want to make a comparison. And, again, I think it would be very useful preliminary work to collate multi-threaded benchmarks which could be used in
22:13:16
hayley
I'm also currently wondering if performing partial collections, like in the Mature Old Space (aka Train) could help with locality while running the garbage collector. Such a collector requires more a complex write barrier, but e.g. Garbage-First on a HotSpot JVM uses something similar, and was made the default collector in 2017.
22:29:11
zephyr
i have been getting into GCs a lot at work lately, and the economy of them is pretty interesting. it's a little conspicuous there isn't "one true GC" and that their performance can vary so widely by workload
23:06:19
hayley
There are now generational versions of ZGC and Shenandoah which reduce the memory overhead in my experience (which consists of playing Minecraft).
23:17:34
hayley
You could also get a free copy of the Azul JVM, which has a generational collector with basically the same design (C4). But I can't legally disclose my thoughts on it.