libera/#sicl - IRC Chatlog
Search
19:23:23
hayley
beach: In the case of thread local nurseries, I am not sure how you benefit from freeing on another thread, if the mutator thread is blocked. But you mention not having a write barrier, which implies not having a generational collector and thus no thread local nurseries.
19:28:32
hayley
There are more recent results showing compaction improving locality, e.g. the many collectors compared in <https://dl.acm.org/doi/10.1145/1375581.1375586>.
19:31:51
hayley
Re-linking the dyads does not appear free to me, but it can probably be done in an efficient manner. Also, I recall the paper on the Compressor argued that it had some locality, as either background threads would handle fixing up a page, or a mutator thread which trapped fixes it up just before using it.
19:34:11
hayley
Amusingly I was thinking about how to implement a global backup compaction pass with little overhead when we don't compact, as no one seems to have investigated compaction for a persistent system like CLOSOS, which could run for years in one heap.
19:35:17
hayley
Rather, they haven't investigated fragmentation over years. And I worry that it will add up over time.
19:39:13
hayley
I had also thought the thread local nurseries were supposed to be about the size of L2 cache, which eliminates most cache misses.
19:41:13
hayley
Another investigation in compacting i recall is <https://shipilev.net/jvm/anatomy-quarks/11-moving-gc-locality/> but it is a veryexaggerated test in my opinion.
19:43:17
hayley
There, objects in an array are rearranged to be in the same order as the array, which improves the locality and performance of walking the array later.
20:00:01
hayley
I guess that is too exaggerated, as GC costs aren't considered at all, and the micro-benchmark only uses one array. But then, I'd suspect the positioning of cons cells in a list also may cause programs to be more dependent on locality, than if they were written with vectors (or CDR coding, since I've already mentioned paging and MMU traps this morning, so why not?)
21:55:31
hayley
That said, there is this paper on a Standard ML implementation using a mark-sweep nursery, which achieves results in the same ballpark as copying: <https://www.pllab.riec.tohoku.ac.jp/papers/icfp2011UenoOhoriOtomoAuthorVersion.pdf>
22:02:25
hayley
However, they only test without multi-threading, and I suspect synchronisation costs are going to be important for a multi-threaded collector. But, as you say, you can avoid it by using allocation buffers of some sort. My current hunch is that compacting only when there are enough small holes is the best strategy.
22:05:39
hayley
With regards to "What if we did not have per-thread nurseries" though: a thread local nursery requires no synchronisation and can fit in cache, so it should be a good idea based on those properties. We'd need to know how well utilised thread local nurseries are, and how slow the write barrier is, if we want to make a comparison. And, again, I think it would be very useful preliminary work to collate multi-threaded benchmarks which could be used in
22:13:16
hayley
I'm also currently wondering if performing partial collections, like in the Mature Old Space (aka Train) could help with locality while running the garbage collector. Such a collector requires more a complex write barrier, but e.g. Garbage-First on a HotSpot JVM uses something similar, and was made the default collector in 2017.
22:29:11
zephyr
i have been getting into GCs a lot at work lately, and the economy of them is pretty interesting. it's a little conspicuous there isn't "one true GC" and that their performance can vary so widely by workload
23:06:19
hayley
There are now generational versions of ZGC and Shenandoah which reduce the memory overhead in my experience (which consists of playing Minecraft).
23:17:34
hayley
You could also get a free copy of the Azul JVM, which has a generational collector with basically the same design (C4). But I can't legally disclose my thoughts on it.
2:35:55
moon-child
you could probably reduce its incidence rate by divvying up your heap into regions. While allocating from region n, link up free cells from regions n+1 and n+2 (say)
2:37:00
moon-child
and that would make some sense if you're sharding your freelists anyway (cf mimalloc)
2:37:33
moon-child
however it would also come with some space overhead, and it wouldn't get rid of the problem entirely; the collector can still slow down the mutator even when the latter is trying to be clever
4:01:22
beach
hayley: I don't know what "benefit from freeing on another thread, if the mutator thread is blocked" means.
4:02:44
beach
I was just wondering what "generational" means in a collector that does not move things.
4:03:03
moon-child
presumably what happens if the mutator gets ahead of the collector, so it's just waiting for the other thread to finish re-linking the dyads
4:04:48
beach
Oh, I see. There would be a free list per thread, and when the free list runs out, either moving from one free list to another or asking for more space from the system.
4:05:51
beach
I am a bit skeptical about the "compaction improves locality" thing, because Paul Wilson's results show that, as he puts it, "Objects that are allocated together, die together".
4:07:08
beach
Relinking dyads is not done by the mutator thread, so it should be attributed to the threads and cores used by the garbage collector.
4:08:26
beach
Compaction is possible, but more difficult, so it may have to be done offline. But that might be acceptable for CLOSOS.
4:08:58
moon-child
'wondering what "generational" means in a collector that does not move things' I think hayley's proposed mechanism of an inline cache predicting allocation lifetime could be used to implement generations
4:10:03
beach
hayley: Yes, the idea that a nursery is the size of a cache will be lost without nurseries.
4:12:01
beach
moon-child: I just don't understand how to determine to which generation a dead object used to belong to and so then not collect it if it is old when a younger generation is collected? Nor do I understand why it would be useful to wait.
4:13:59
beach
hayley: Aside from locality concerns, I think it has been shown that mark-and-sweep collectors and copying collectors perform roughly equally well.
4:17:56
beach
zephyr: Exactly. What I am trying to do here is to avoid the trap of believing one particular result for one particular language with one particular workload and which does not have the characteristics of SICL with its separation of dyads and racks.
4:19:42
beach
moon-child: Yes, dividing up the heap may make sense, especially if the division is flexible.
4:42:00
beach
So I think I would like to use a strategy like "always measure one level deeper" here. And instead of measuring one type of garbage collector to another type, try to measure the part that the garbage collect takes of overall performance, and what aspect of the garbage collector is responsible for that performance.
4:43:00
beach
I mean, I want to avoid the maintenance of a more complicated collector that is perhaps faster than a simpler one, but where the performance of the collector matters very little to overall performance.
4:44:59
beach
I know this thinking is contrary to common wisdom. But I fear that a lot of work is being done (in many Common Lisp implementations) that matters little to overall performance, and that increases the maintenance burden.
4:45:50
beach
The reason I think that is that "improvements" seems to be decided upon based on information other than measurements.
4:46:36
beach
And, perhaps contrary to the strategy of other implementations, I am willing to trade some performance for better maintainability.
4:47:54
sm2n
On that note, you may find this amusing: <https://zeux.io/2022/01/08/on-proebstings-law/>
4:49:04
sm2n
"LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old. This may be a general rule, something specific to highly tuned code, or something specific to meshoptimizer algorithms."
4:49:53
sm2n
of course, this is not a proper study, but it is still funny especially considering how bad the story is for projects building on top of llvm wrt api stability and such
4:50:04
beach
sm2n: Thanks for the link. Yes, I am more interested in things like debugging experience than in performance at all cost.
4:51:18
beach
Yes, I am very pleased that I didn't take the advice from drmeister to use LLVM as a backend for SICL.
5:32:30
beach
moon-child: Locality of reference matters as you pointed out, but only for the first reference to some datum, right? Once it is in the cache, locality no longer matters. Or am I missing something?