libera/#sicl - IRC Chatlog

21:55:31 hayley That said, there is this paper on a Standard ML implementation using a mark-sweep nursery, which achieves results in the same ballpark as copying: <https://www.pllab.riec.tohoku.ac.jp/papers/icfp2011UenoOhoriOtomoAuthorVersion.pdf>

22:02:25 hayley However, they only test without multi-threading, and I suspect synchronisation costs are going to be important for a multi-threaded collector. But, as you say, you can avoid it by using allocation buffers of some sort. My current hunch is that compacting only when there are enough small holes is the best strategy.

22:05:39 hayley With regards to "What if we did not have per-thread nurseries" though: a thread local nursery requires no synchronisation and can fit in cache, so it should be a good idea based on those properties. We'd need to know how well utilised thread local nurseries are, and how slow the write barrier is, if we want to make a comparison. And, again, I think it would be very useful preliminary work to collate multi-threaded benchmarks which could be used in

22:05:39 hayley such tests.

22:13:16 hayley I'm also currently wondering if performing partial collections, like in the Mature Old Space (aka Train) could help with locality while running the garbage collector. Such a collector requires more a complex write barrier, but e.g. Garbage-First on a HotSpot JVM uses something similar, and was made the default collector in 2017.

22:29:11 zephyr i have been getting into GCs a lot at work lately, and the economy of them is pretty interesting. it's a little conspicuous there isn't "one true GC" and that their performance can vary so widely by workload

22:29:33 zephyr ZGC is the one i've been monitoring and using the most lately. weird thing.

23:06:19 hayley There are now generational versions of ZGC and Shenandoah which reduce the memory overhead in my experience (which consists of playing Minecraft).

23:06:43 hayley Well, both are in development and you need to build your own JVM, but they exist.

23:13:13 zephyr oh, interesting

23:17:34 hayley You could also get a free copy of the Azul JVM, which has a generational collector with basically the same design (C4). But I can't legally disclose my thoughts on it.

23:18:17 hayley (It's free as in it costs zero dollars, and proprietary still.)

0:22:53 moon-child oh there's generational zgc now!

2:34:02 moon-child hm, regarding off-thread freeing

2:34:41 moon-child it occurs to me that that would suffer from false sharing

2:35:55 moon-child you could probably reduce its incidence rate by divvying up your heap into regions. While allocating from region n, link up free cells from regions n+1 and n+2 (say)

2:37:00 moon-child and that would make some sense if you're sharding your freelists anyway (cf mimalloc)

2:37:33 moon-child however it would also come with some space overhead, and it wouldn't get rid of the problem entirely; the collector can still slow down the mutator even when the latter is trying to be clever

4:00:16 beach Good morning everyone!

4:01:22 beach hayley: I don't know what "benefit from freeing on another thread, if the mutator thread is blocked" means.

4:02:44 beach I was just wondering what "generational" means in a collector that does not move things.

4:03:03 moon-child presumably what happens if the mutator gets ahead of the collector, so it's just waiting for the other thread to finish re-linking the dyads

4:04:48 beach Oh, I see. There would be a free list per thread, and when the free list runs out, either moving from one free list to another or asking for more space from the system.

4:05:51 beach I am a bit skeptical about the "compaction improves locality" thing, because Paul Wilson's results show that, as he puts it, "Objects that are allocated together, die together".

4:07:08 beach Relinking dyads is not done by the mutator thread, so it should be attributed to the threads and cores used by the garbage collector.

4:08:26 beach Compaction is possible, but more difficult, so it may have to be done offline. But that might be acceptable for CLOSOS.

4:08:58 moon-child 'wondering what "generational" means in a collector that does not move things' I think hayley's proposed mechanism of an inline cache predicting allocation lifetime could be used to implement generations

4:10:03 beach hayley: Yes, the idea that a nursery is the size of a cache will be lost without nurseries.

4:12:01 beach moon-child: I just don't understand how to determine to which generation a dead object used to belong to and so then not collect it if it is old when a younger generation is collected? Nor do I understand why it would be useful to wait.

4:13:59 beach hayley: Aside from locality concerns, I think it has been shown that mark-and-sweep collectors and copying collectors perform roughly equally well.

4:14:04 moon-child I think you could still improve locality even if you only ever did full-heap gcs

4:16:23 beach Sounds plausible.

4:17:56 beach zephyr: Exactly. What I am trying to do here is to avoid the trap of believing one particular result for one particular language with one particular workload and which does not have the characteristics of SICL with its separation of dyads and racks.

4:19:42 beach moon-child: Yes, dividing up the heap may make sense, especially if the division is flexible.

4:21:54 beach ACTION goes to make some coffee.

4:42:00 beach So I think I would like to use a strategy like "always measure one level deeper" here. And instead of measuring one type of garbage collector to another type, try to measure the part that the garbage collect takes of overall performance, and what aspect of the garbage collector is responsible for that performance.

4:43:00 beach I mean, I want to avoid the maintenance of a more complicated collector that is perhaps faster than a simpler one, but where the performance of the collector matters very little to overall performance.

4:44:59 beach I know this thinking is contrary to common wisdom. But I fear that a lot of work is being done (in many Common Lisp implementations) that matters little to overall performance, and that increases the maintenance burden.

4:45:50 beach The reason I think that is that "improvements" seems to be decided upon based on information other than measurements.

4:46:36 beach And, perhaps contrary to the strategy of other implementations, I am willing to trade some performance for better maintainability.

4:47:54 sm2n On that note, you may find this amusing: <https://zeux.io/2022/01/08/on-proebstings-law/>

4:49:04 sm2n "LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old. This may be a general rule, something specific to highly tuned code, or something specific to meshoptimizer algorithms."

4:49:53 sm2n of course, this is not a proper study, but it is still funny especially considering how bad the story is for projects building on top of llvm wrt api stability and such

4:50:04 beach sm2n: Thanks for the link. Yes, I am more interested in things like debugging experience than in performance at all cost.

4:51:18 beach Yes, I am very pleased that I didn't take the advice from drmeister to use LLVM as a backend for SICL.

5:32:30 beach moon-child: Locality of reference matters as you pointed out, but only for the first reference to some datum, right? Once it is in the cache, locality no longer matters. Or am I missing something?

5:50:57 hayley Good morning beach!

5:51:38 hayley I just finished with a student meetup in the city.

5:52:09 beach Was it good?

5:52:36 hayley I'd better wait until I am at home to read the backlog, and not on the bus still.

5:53:00 beach Sure.

5:53:02 moon-child beach: not if data are smaller than a cache line

5:53:22 beach moon-child: Yes, I see. Good point.

5:53:56 moon-child and, consider iterating over a list: if the nodes are all sequential, the cpu will detect the linear access pattern and prefetch later nodes

5:54:10 hayley It was great. Quite nice to see people in person after mostly just knowing them by name and voice.

5:54:21 beach moon-child: Yes, that's what I meant by the first time.

5:54:32 moon-child ok

5:54:51 beach hayley: I see, yes. That's one reason I want to go to Porto rather than doing ELS online.

5:57:51 hayley All a generational collector requires, to me, is that a young generation can be collected independent of old generation(s). Hence a generational collector does not require moving; you just need to be able to partition the heap somehow.

5:58:14 moon-child it would still probably require barriers, though

5:58:26 beach True. And there is the "fake moving" collector identified by Paul Wilson.

5:58:49 hayley The scheme I have been thinking of is "pretenuring", where we predict that objects allocated by one particular allocation point will tend to live longer, based on runtime feedback.

5:59:40 moon-child '"fake moving" collector'?

6:00:01 moon-child I can not find any references to that

6:00:02 beach Yes, that's what he called it.

6:00:34 hayley Yes, you'd still need a write barrier to identify old-to-new references. (Well, the Go developers experimented with hashing...somehow to identify updated pages without a barrier, but it didn't work too well)

6:00:52 beach moon-child: I forget the paper in which he mentions it. Maybe the GC survey?

6:02:57 hayley But while objects that are allocated live together may tend to die together, it might not be the case that the mutator will use them in the order they were allocated. Shipilëv's test program produces a worst case of sorts, wherein the iteration order is completely random with regards to allocation order.

6:03:23 moon-child I find a reference only to 'fake copying'; perhaps that was it?

6:03:39 beach Ah, yes, sorry!

6:03:53 hayley ("allocated together", not "allocated live together")

6:08:05 moon-child that is interesting. Though I object to the assertion that '[j]ust as in a copy collector space reclamation is implicit'; the reclaimed objects are still operated upon

6:09:02 moon-child (in particular, they must be relinked when yet-reachable objects are moved to the alternate 'semispace')

6:11:20 moon-child and the addition of a colour bit is a bit of a giveaway, imo; this really seems closer to mark-and-sweep

6:12:38 hayley Is it much like Baker's treadmill collector?

6:13:44 moon-child I don't know. https://www.cs.cmu.edu/~fp/courses/15411-f08/misc/wilson94-gc.pdf page 13

6:13:51 hayley ...wherein the fromspace and tospace are represented by linking objects on doubly linked list. But maybe not, if there is a colour bit.

6:14:52 hayley It seems to me that Wang and Baker proposed the same idea, and Wilson just wrote the description in this survey.

6:16:03 beach I think they are very similar, and I don't know which one was first.

6:17:10 beach I am still intrigued by this idea of taking into account multiple cores. I know moon-child said "but the work has to be done" or something like that, but it is looking like we are going to get more and more cores, and that it is going to be hard to get them all to do useful work.

6:17:24 beach If that is the case, then a non-moving collector is perfect.

6:17:48 beach Each thread just has to do the marking of the roots.

6:18:01 hayley But with regards to compacting or not, it appears that the benchmarks taken in <https://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf>, notably on pages 1 and 6, suggest that compaction lowers the time and cache misses taken by the mutator. I don't know if it is due to locality, though.

6:19:39 hayley What about memory bandwidth? There are already "allocation walls" wherein memory bandwidth simply prevents one from writing a faster GC, without improving cache utilisation (perhaps by thread-local GC).

6:20:18 beach Yes, another good question.

6:21:01 beach So then the question becomes: What is the relative importance of improved mutator performance due to moving/copying/generational etc. and improved collection performance by having free cores do all the work?

6:21:02 moon-child ha, treadmill paper refers to 'brenda baker'. No relation, apparently

6:21:47 beach I mean, I don't expect any answers. But we should think about such issues.

6:21:53 hayley I recall Cliff Click claimed "one of the things Azul had (with 864-core machines) was the magical property of having a f**kton of bandwidth." (His words, not mine!) Azul machines also had a lot of spare cores to run GC and JIT compilation on, apparently.

6:23:04 beach Also, if all a mutator thread has to do is mark the roots, then pauses will be very brief in most cases.

6:23:20 moon-child yeah, that seems like a somewhat ideal position for a jit to be in

6:23:33 moon-child beach: if the mutator only marks the roots, and other marking happens off-thread, then you need barriers

6:23:44 hayley But the hardware and software eventually implemented generational GC, so I guess there were still gains to be had.

6:23:51 beach moon-child: I don't see that.

6:25:13 hayley A mutator thread still has to enforce some invariant like "no black objects refer directly to white objects", and a barrier would still be necessary to enforce it. But the slow path of such a barrier is still not too slow.

6:25:17 moon-child Suppose I have X, Y, Z. X Y are roots, and we have Y -> Z. First the mark thread traces X. Then the mutator creates a reference X -> Z, and destroys the reference Y -> Z. Then the mark thread traces Y. Then it is done. Z is still white

6:25:42 beach Ah, yes, I see.

6:27:42 hayley Hence, a concurrent copying collector "just" has another kind of barrier, which is usually a read barrier. We suspect that read barriers might be slower, as reads are more frequent than writes, but mutator threads are still not doing much collection work.

6:31:42 moon-child something that occurs to me: when the mutator trips a barrier, it fixes up its pointer. This is a bit like the go strategy of moving gc work onto a mutator thread if the latter gets ahead of the former, except that you only do it when you have to

6:32:05 moon-child (and of course, go does not copy)

6:32:17 hayley Right, the aim is to get the collector to eventually catch up with the mutator.

6:37:04 hayley (Must say, the jokes in the meetup were _awful_, as it was entirely comprised of computer science students. We went for lunch, and I proposed to collect all the rubbish and take it to the bin. When I came back, I couldn't tell what the other students were discussing, and proclaimed "Damn concurrent collection, you've started talking again without me!")

6:51:16 hayley Is Wilson's claim that "objects that are allocated together, die together" making a claim about temporal locality or spatial locality though? i.e. is it "objects allocated around the same time die around the same time", or "objects allocated nearby in memory die around the same time"? A bump allocator makes "allocated around the same time" and "allocated nearby in memory" very similar, but I am not so sure for a mark-sweep collector.

6:52:27 moon-child I can't imagine he was referring to spatial locality

6:53:44 hayley I think good locality is achieved when "objects allocated nearby in memory are used around the same time", which is an opposite of sorts of the latter interpretation.

6:54:48 moon-child I agree

6:58:27 hayley So, "objects that are allocated at around the same time also die around the same time" is useful for predicting the lifetimes of objects (i.e. pretenuring), but I am not sure how it interacts with locality with a mark-sweep collector. The interaction with locality with compacting collection (or moreso bump allocation, but the two are correlated) is easier to spot, as objects allocated around the same time are also nearby in memory.

7:07:21 beach I think Wilson is talking about time.

7:08:31 beach And the example of traversing a list is interesting, because you can't have both the CAR and the CDR nearby.

7:08:48 beach So a compacting GC needs to make a choice.

7:09:02 moon-child well, depends on how near you need them to be :P

7:09:48 hayley I recall reading a study about making the decision to line up CARs or CDRs in Interlisp, which then presented the entropy of the resulting words in the heap.

7:10:26 beach But that was before the processor cache hierarchy we now have, right?

7:10:50 hayley Right, compacting was used to avoid paging instead of cache misses.

7:11:11 beach Ah, yes, it all comes back. :)

7:11:21 moon-child same hierarchy, we just added a hard drive on the bottom ;)

7:11:59 beach Heh.

7:13:45 hayley Admittedly I have to wonder if designs that avoid paging are still useful for avoiding cache misses. One reason against such logic is that reading a random page off disk is slower than reading a random cache line out of primary memory, and disk pages tend to be at least a magnitude larger.

7:25:27 moon-child For better or worse, the cache is much more 'dynamic'. You can not enquire what is currently in it, and it is more possible to page memory in, lazily and asynchronously

7:26:22 moon-child I suppose cache also complicates concurrency. Were there any multiprocessors without cache?

7:27:40 White_Flame maybe the transputer?

7:28:15 White_Flame anything with message passing doesn't need consistency

7:29:24 moon-child wiki says 'Unlike the earlier models, the T9000 had a true 16 KB high-speed cache'

7:33:53 White_Flame yeah, the earlier had only "ram" on chip, not a cache

7:38:24 hayley The Raspberry Pi Pico has two cores, no cache, and no CAS or LL/SC instructions.

7:44:24 moon-child ;o

7:44:26 moon-child how does _that_ work?

7:44:29 moon-child just hope you get lucky?

7:47:02 hayley Same way it works in hardware; you have to employ some other algorithm for mutual exclusion, like the bakery algorithm.

7:47:26 hayley ACTION taps http://lamport.azurewebsites.net/pubs/lamport-mutual-solved.pdf

7:54:11 hayley The runtime library provided by the Pi Foundation includes a mutex implementation, but I haven't checked how they actually do it.