libera/#sicl - IRC Chatlog

3:00:16 beach Good morning everyone!

3:00:33 hayley Good morning beach!

4:26:48 beach hayley: Remember that we think that the current plan for the SICL garbage collector will have a problem when the write barrier is tripped and objects need to be promoted. As you pointed out, there are likely many situations where a nursery collection will be triggered way too often.

4:27:03 beach So here is another idea: We do not trigger a nursery collection, and instead we just promote objects, and mark them as such in the nursery. Then, the compiler inserts code, following each function call, that checks live variables referring to nursery objects to see whether they have been promoted and, if so, updates the pointer. Only when the nursery is full do we trigger a nursery collection.

4:27:05 beach It might be possible to avoid many of these checks by some compilation analyses. In particular, if the nursery contains only CONS cells (as I hinted the other day might be sufficient), then type inference might help.

4:30:21 beach Also, some function calls might be known not to trip the write barrier, so in that case, no checks are needed after it returns.

4:33:39 hayley That's a good idea. There is a sort of incremental stack scanning for concurrent GC, where only a small set of stack frames is scanned at a time, in order to reduce pause time. The return address of the oldest frame in each set is modified to return to a barrier routine, which then scans the next set of frames, and installs the routine there. The end result is that the stack is scanned lazily, and it seems similar to what you suggest.

4:40:52 hayley It still appears necessary to fix up references to promoted objects coming from other objects in the nursery. I linked a paper a while ago about using a...sort of remembered set to find the set of objects that may need to be fixed up; but I wasn't entirely convinced because the entire stack would still have to be scanned. Using incremental stack scanning would solve that issue, and would also reduce the number of times that older stack frames have

4:40:53 hayley to be scanned.

4:46:39 hayley I still think avoiding any pointer fixing, by having functions that create objects which usually escape to the global heap allocate directly into the global heap, would be better, but I have no idea what sort of analysis or runtime feedback would still produce enough local allocations to justify having local heaps.

4:51:15 beach Why wouldn't recursively promoting any object pointed to solve that problem?

4:51:30 beach ... the problem of fixing up pointers, I mean.

4:52:10 beach And, yes, allocating directly in the global heap when it is likely that an object will be promoted is a good idea.

4:52:58 hayley Say, just observing the function causing allocations wouldn't work too well, if some functions like MAP can create objects of very different lifetimes. I'd assume ML has a similar issue, but the ML people looked into region-based memory management, which is already polymorphic, rather than pre-tenuring, I think.

4:53:05 nyx_land_ ** NICK nyx_land

4:53:14 beach For example: (push <mumble> *variable*). Here, the CONS cell being allocated is going to be referred to by *variable* which may already be in the global heap.

4:55:32 hayley The problem would arise in something like (let* ((x (make-something)) (y (list x))) (promote x) ...) as the value bound to Y still references the original value of X, and not the copy in the global heap.

4:56:01 hayley The problem involves references to the promoted object, not references from the promoted object.

5:03:22 beach Well, the idea was to check for that situation whenever code has been executed that may promote an object. There would be a bitmap in the nursery, indicating what objects have been promoted. The code following a function call would check a bit in that bitmap.

5:04:10 beach So any lexical variable that the compiler thinks may point to the nursery would then be checked.

5:05:33 hayley Right then. How would we set the bit corresponding to the value of Y?

5:08:22 beach Ah, yes, I see what you mean. So I guess when a lexical variable is assigned to, then a check would also have to be made. That might be a lot more costly indeed.

5:08:22 moon-child ACTION whispers: read barrier

5:08:24 moon-child ACTION hides

5:09:01 moon-child (if you have a read barrier for concurrent gc in the global heap, you may be able to reuse it for objects which have moved out of the nursery)

5:09:59 beach Read barriers sound costly though. I might be wrong of course. Perhaps the check is dwarfed by the memory operation.

5:10:05 beach I need to think about that.

5:10:42 moon-child as I recall, there was some discussion of this in the past, and hayley reported that steve blackburn(?) gave some reason why read barriers are not so expensive as they seem

5:11:25 moon-child I don't remember the details now, though

5:12:26 hayley This situation is handled by the remembered set I mentioned, I think. The thread-local nursery is split up into a part which was allocated after the last time the write barrier was trapped, and a part which was allocated before. Assignments causing an object in the latter part to refer to an object in the former part cause the object (or slot) to be logged somehow. When the barrier is tripped, it is only necessary to scan the former part of the

5:12:27 hayley heap, and the logged objects.

5:12:29 moon-child hmm, actually. It's not per se a problem to have one copy of an object in the nursery and another in the global heap, as long as all you're doing is reading from other. So all you need is a write+EQ barrier

5:12:35 moon-child *from either

5:12:55 hayley moon-child: That sounds a bit like "replication copying".

5:13:27 hayley Though somewhat simpler, as the collector doesn't have to manage concurrent copying.

5:15:24 beach Thanks for the feedback. More thinking required on my part. Please feel free to continue the discussion though. Also, this is Monday, and Monday mornings are chaotic around here. So I'll be off and on for the next few hours.

5:16:33 moon-child hayley: yeah, that's where I got the idea from

5:16:41 moon-child (though I don't think I had it attached to that particular name, so thanks)

5:16:54 hayley moon-child: I had a discussion with Gil Tene (the CTO of Azul) on read barriers, because he had claimed that the read barrier for Pauseless/C4 had lower overhead than the write barrier for G1.

5:17:10 CodeBitCookie[m] Good Morning Everyone!

5:17:11 moon-child ah, yes, gil tene, that was it

5:18:24 hayley https://twitter.com/nodefunallowed/status/1433930547265376258 The summary is that the read barrier can only be tripped once per slot per GC cycle, whereas the write barrier can be tripped once per assignment to a slot.

5:18:27 hayley Good morning CodeBitCookie!

5:18:30 CodeBitCookie[m] *no-defun-allowed*: :)

5:19:20 moon-child ACTION nods

5:19:24 hayley However, I am not sure if the Pauseless barrier specifically is appropriate for incrementally fixing up pointers to a few relocated objects, as opposed to doing a proper garbage collection.

5:22:41 hayley Similarly I'd have to think about if the barriers for replication copying would suffice. The issue is that, in incremental copying collection, the collector is often given a "head start" as the roots are always moved to new-space. I also thought there was a read barrier for replication copying, but I don't remember.

5:23:32 hayley .oO( To the Handbook! )

5:24:35 hayley "Replication copying collectors ... relax this requirement by allowing the mutator to continue operating against fromspace originals even while the collector is copying them to newspace." There we go.

5:27:40 hayley But I don't see how that could work here. We have a copy in the global heap, and writes to the global copy should (eventually) become visible when reading the local copy somehow.

5:38:09 beach I think I should implement only the global collector initially, and then I'll put you two in charge of coming up with a reasonable nursery technique, including synchronization, barriers, etc.

5:38:53 beach I am pretty attached to the technique of the global heap, so I am not willing to budge in that respect. At least not initially.

5:39:59 beach But I do think we need to "always measure one level deeper" whenever possible.

5:40:16 hayley The Nettles et al paper on replication copying suggests "write operations must modify both to-space and from-space replicas", which I wouldn't know how to do atomically; as we are still lacking transactional memory in 2022.

5:46:30 hayley Supposing there is a way to atomically update two locations at once (the dreaded "double CAS"), we could still try to fix up any locations that might have pointers to the original object, similar to what beach suggested, reducing the frequency that the barrier is tripped, similar to the property the read barrier for Pauseless has.

5:49:44 hayley Fixing the registers is likely to be quite fast compared to emulating a double CAS (or taking a lock while writing, or something else that I haven't thought of). If we still need to perform a lot of replicated writes, we could perform a nursery collection and/or adjust our heuristics for allocating into the global heap.

6:28:01 contrapunctus I've been using `eclector.parse-result:read` to read forms in files; but I can't figure out how to get skipped input (specifically comments), or even its location. I've been trying out different things based on the manual (defining a `make-skipped-input-result` method, trying to access the second return value of `read`), but to no avail...

7:18:10 beach contrapunctus: I think you need to wait for scymtym to have time to answer.

7:18:25 contrapunctus Thanks, I will 🙂

7:27:25 hayley moon-child: Do you know any literature on double-CAS or otherwise doing writes to two replicas atomically?

7:29:57 moon-child I haven't read it, but someone

7:30:03 moon-child linked https://www.cl.cam.ac.uk/research/srg/netos/papers/2002-casn.pdf a while ago

7:32:14 moon-child 'reserves a small and constant amount of space in each word updated (either 0 or 2 bits)' can they be spared? If not, maybe an out-of-line bitmap

7:32:23 hayley One other complication is that writes to global objects from other threads would require a check, to see if we need to replicate the write to another nursery...and we'd need to store that nursery too, oh dear. It's also necessary to have some sort of synchronisation with the other nursery, so that doing the atomic update won't race with compaction. I guess we could have a hash table of local originals of global replicas, a bit like lazy promotion

7:32:23 hayley for address hashing.

7:33:09 hayley The relevant tables would be very sparse, I think.

7:33:40 moon-child that said--I guess the idea is something gets copied to the global heap, but a copy stays in t1's nursery, and then t2 tries to modify it? That seems fraught

7:33:51 hayley Yep, that's the idea.

7:35:13 hayley We double-CAS to keep the two replicas synchronised. If there are too many DCAS updates, we suck it up and do a nursery collection, to remove replicas.

7:35:56 moon-child that write constitutes a concurrent modification. If it isn't done with any strong or atomic memory ordering, I think the memory model may allow the state to be incoherent

7:36:37 moon-child (or, rather, it does if t1 is writing the object at the same time, which is what would cause the problems)

7:37:18 hayley Sure, but we don't exactly have a memory model, and I think the Java memory model wants us to pick a winning write. But I haven't read the JMM properly, and I would guess replication copiers (like Sapphire) for Java have to handle this problem too.

7:38:32 moon-child (t1 could have a pointer both to the nursery copy of the object and to the global copy of the object. If t2 updates both in an unsynchronised fashion, t1 could observe the writes in either order, so it could observe a new state followed by the old state. I'm pretty sure the memory model has to allow this, in order to permit things like mem->register promotion)

7:39:21 moon-child hayley: well, there is bike's proposal. I suspect that's the most relevant thing here, even if it is not finished yet

7:39:47 moon-child 'pick a winning write' well--at _some_ point, you would synchronise; that would happen when you do a nursery collection (or if you have a fence or w/e)

7:40:28 moon-child and I'll add, value in nursery being different from value in global heap is very similar to value in one core's cache being different to the value in another core's cache. So I would _guess_ this would be ok

7:41:38 hayley One thing I need to have a think about is it would appear that writing to a replicated object and doing a compaction cannot occur concurrently, which suggests that each nursery requires a lock. In theory this is distressing, as nursery collections are no longer lock-free, but the critical section for updating a replica perhaps isn't too long.

7:43:00 moon-child hmm can't it?

7:43:20 moon-child ideally the nursery compactor would just fight with the other thread for the specific replicated object

7:43:31 moon-child rather than locking the whole nursery

7:43:36 moon-child which means you can move other objects while you wait

7:43:54 hayley Hm, it might suffice to disable replication for global objects just before compaction; I'd have to re-read beach's paper on the collection algorithm. But more sinister is a situation like T1 starts a DCAS, T2 starts compacting, T1 now updates into somewhere random, because it didn't know that the object is no longer being replicated.

7:44:19 hayley ACTION grumbles how many times she says "replicated object", when it has nothing to do with her published description of a "replicated object system".

7:45:22 moon-child oh hmm right

7:45:42 moon-child see, this is why I was dubious of letting threads peek into each other's nurseries :P

7:45:49 moon-child but presumably there is a backpointer

7:45:59 moon-child so the global replica is the 'canonical' one

7:46:15 hayley There is, yes.

7:46:16 moon-child t1 and t2 race to set a flag on that one, and whichever one wins gets to diddle or move it

7:46:38 moon-child if t2 loses, it leaves the object alone while it compacts the others

7:47:19 moon-child most objects are not replicated, so it's a free branch most of the time

7:47:33 hayley Okay, so the nursery collector has to lock every object before compacting? It's quite likely to win, if we disable replication just before, and it'd only lose the race if some other thread races which is unlikely.

7:51:14 moon-child hmm. So a couple more things: 1, maybe replicated objects should always be moved into the global heap following a minor gc (so at that point you just flush your write and then set a forwarding pointer); 2, alternately, can we demote a replicated object back to a nursery object?

7:51:42 hayley I think the current idea is to do #1 as we are trying to promote objects "lazily".

7:56:37 moon-child hmm, it can be done with a single atomic: reserve a lock bit in the pointer from the global replica to the local one. I don't know how much that helps. But I think the main important point is that if you lose the lock, you can keep doing work; don't have to wait for it

7:58:57 moon-child !

7:59:16 moon-child if the nursery is semispace rather than sliding, then you can just leave the old object there, and don't care if another thread reads a stale pointer

7:59:31 moon-child only have to synchronise with the _next_ minor collection, which should not be a problem

8:00:33 hayley Hm, now I don't know if that DCAS will be atomic with respects to ordinary reads. If we have to do something fancy to make reads synchronise properly, we may as well have some other sort of read barrier.

8:07:58 moon-child you are probably right. But I still suspect that whatever memory model is ultimately adopted will be weak enough that it doesn't have to be atomic

8:12:12 hayley Perhaps so. But then, how do we have to behave around a fence? I was hoping to have atomic updates, to avoid the hard problems.

8:13:01 moon-child fence could check if any updates have been performed on objects within this nursery

8:13:29 moon-child maybe that's too expensive? hmm

8:13:51 hayley Have I complained about the lack of transactoinal memory enough today?

8:13:58 moon-child :D

8:14:01 hayley *transactional

8:14:15 moon-child spitballing--maybe fencing can increment some generation number, and if another thread wants to mutate a replicated object and its generation number is too old, it has to do extra work?

8:14:46 hayley Though, I could not say what to do if the micro-architecture hates us, or we otherwise keep failing to commit.

8:17:26 hayley We could predict that, based on how caches and transactional memory tend to work, this scheme won't spuriously fail, but there is no guarantee in the architecture specification, I think.

8:19:44 beach So we don't have hardware transactional memory, but do we have speculative lock elision? If so, locks might be very inexpensive.

8:20:57 hayley Perhaps in our dreams. Intel had speculative lock elision, but I think they had to disable it due to bugs.

8:21:09 beach Oh! :(

8:21:50 beach But do we know whether it will be activated again in the future?

8:22:30 hayley However, locking wouldn't help in this situation either. The plan I had was to have some sort of double-CAS, such that unsynchronised reads would observe either the old or the new value in both replicas, without the possibility to read the old in one replica, and the new in another replica.

8:22:47 beach Yes, I see.

8:23:59 hayley Wikipedia claims there will be a "TSXLDTRK" instruction set for Xeon processors set to be released in the second half of this year. Of course, the issue is that it will be a feature you only get on server CPUs, which will not be useful for many users.

8:24:46 beach Hmm, OK.

8:26:01 moon-child why don't they put the software people in charge of hardware design

8:26:06 hayley I believe someone once said "PC manufacturers cannot invent acronyms."

8:26:07 moon-child :/

8:27:04 moon-child in fairness, computer scientists (like physicists) are known for being absolutely awful at naming things

8:39:12 hayley It would appear Sapphire does unsynchronised writes. It also has to replicate writes of unboxed slots, too.

8:53:58 hayley At the least, we've established that fixing up pointers incrementally is quite similar to replication copying. And the description of Sapphire in the paper already assumes there can be a part of the heap we don't want to copy, and that the whole world doesn't need to be stopped for phase changes, so hopefully we don't have to invent too much now.

8:55:03 hayley The paper claims "Programs with data races may exhibit new behaviors under Sapphire, but the effects are similar to those allowed when optimizing ([12, p. 378]). Further, the values stored into any field are ones stored by some thread, so Sapphire cannot violate Java type-safety. In any case, for properly synchronized programs, Sapphire produces results consistent with the Java Virtual Machine Specification." So I think we are safe in that regard,

8:55:03 hayley too.

9:18:04 hayley Hm, the...other paper by Hudson and Moss claims that volatile slots in Java are handled by ensuring that the operations are performed on the newspace replica, if one exists. This is a bit like a read barrier, though it might not appear in generated code many times, if we only have to emit it near barriers.

9:19:13 moon-child what is 'near'?

9:20:19 hayley Good question. I'm only certain that we need the read barrier when a fence is performed.

9:21:35 moon-child I don't disagree, but you don't know, at any given point, if a fence was in fact performed

9:21:49 moon-child so don't you need to barrier in case it was?

9:22:02 hayley Right.

9:22:58 moon-child hmmm. Maybe compile two versions of every function: one with read barriers, and one without. After you perform a fence, you osr to the read-barrier code. When you do a minor gc, you switch back

9:23:28 moon-child if doing a lot of fences, you should stay in the read-barrier code pretty much all the time (and maybe the minor gc has some heuristic to not switch back then), so you don't pay so many return mispredicts

9:25:18 hayley I probably should see what the Sapphire people did. The paper only mentions non-volatile and volatile fields in Java, and nothing on other sorts of atomic access.

9:25:29 moon-child ACTION nods

9:33:09 hayley Maybe the issue could be avoided by making sure a mutator flips its roots to newspace, before fencing. But that doesn't help us at all, as we are trying very hard to not flip roots.

9:57:16 hayley My wild guess at the moment is that the read barrier is only necessary when performing atomic updates, but I have no idea.

10:01:09 hayley But then something like (let ((x 'old)) (atomically-setf x 'new) x) could return either OLD or NEW in the resulting memory model, which is pretty weird.

10:20:43 hayley ...no it couldn't, what am I saying? One "issue" occurs when a replicated write occurs concurrently with two unsynchronised reads to either replica, and the reads read both the old and new value. But this is fine, as the write can be observed to happen in between reads.

10:24:57 hayley Another issue occurs when two threads try to perform replicated writes concurrently, in which case the replicas can remain out of sync indefinitely. I think we could get away with locking, if the critical section is only two writes, and the locks have fine granularity (say, one lock bit per 64 bytes, for arbitrary values of 64). A similar thing applies for atomic updates (though, if the user wants lock-free updates, it might be nice to give them

10:24:57 hayley lock-free updates). So I'm starting to think we'll be fine.

10:29:13 hayley And, of course, if we spend too much time locking while performing replicated writes, we can cut our losses and just do a minor GC/promotion. Though the first issue might be a real issue, if we observe the new value, and then the old value again, by reading replicas in the opposite order to which they were written. The memory model would likely allow reordering, but not across a fence, which is the real problem here.

10:49:38 hayley I have a vague idea, but so far there isn't a chance this idea can be done efficiently. It might suffice for a fence to ensure that all locks have been released, i.e. there are no replicated writes occuring concurrently with the fence. Therefore it is not possible for reads occurring after the fence to miss any writes which occurred before the fence, I think.

10:50:45 hayley Though, again, replication copying is hopefully uncommon and required of very few objects, so a fence might just require iterating over a table of replicated objects, if there are any replicated objects to deal with.

11:09:39 hayley That would look like thread-local heaps for mutable objects, without having to fix every root when the write barrier is tripped. We only had to invent our own replica coherency protocol in software.

12:10:55 hayley Perhaps this scheme is too slow in practise; I have no idea and would have to test. Speculative lock elision could be useful, if we get a working implementation in the rest of this century. And we can still decide to avoid local allocation or stomach a minor GC if the time overhead of replication is intolerable.

13:24:46 zak1 ** NICK z4kz

13:28:32 scymtym beach: i have already written down some requirements for the environment library. i can try to make those available but i'm not sure whether i will manage to do it today

13:29:22 beach Great! No rush. I decided Clostrum is good enough for me to continue working on bootstrapping, so I won't work on it for a while.