libera/#sicl - IRC Chatlog
Search
4:26:48
beach
hayley: Remember that we think that the current plan for the SICL garbage collector will have a problem when the write barrier is tripped and objects need to be promoted. As you pointed out, there are likely many situations where a nursery collection will be triggered way too often.
4:27:03
beach
So here is another idea: We do not trigger a nursery collection, and instead we just promote objects, and mark them as such in the nursery. Then, the compiler inserts code, following each function call, that checks live variables referring to nursery objects to see whether they have been promoted and, if so, updates the pointer. Only when the nursery is full do we trigger a nursery collection.
4:27:05
beach
It might be possible to avoid many of these checks by some compilation analyses. In particular, if the nursery contains only CONS cells (as I hinted the other day might be sufficient), then type inference might help.
4:30:21
beach
Also, some function calls might be known not to trip the write barrier, so in that case, no checks are needed after it returns.
4:33:39
hayley
That's a good idea. There is a sort of incremental stack scanning for concurrent GC, where only a small set of stack frames is scanned at a time, in order to reduce pause time. The return address of the oldest frame in each set is modified to return to a barrier routine, which then scans the next set of frames, and installs the routine there. The end result is that the stack is scanned lazily, and it seems similar to what you suggest.
4:40:52
hayley
It still appears necessary to fix up references to promoted objects coming from other objects in the nursery. I linked a paper a while ago about using a...sort of remembered set to find the set of objects that may need to be fixed up; but I wasn't entirely convinced because the entire stack would still have to be scanned. Using incremental stack scanning would solve that issue, and would also reduce the number of times that older stack frames have
4:46:39
hayley
I still think avoiding any pointer fixing, by having functions that create objects which usually escape to the global heap allocate directly into the global heap, would be better, but I have no idea what sort of analysis or runtime feedback would still produce enough local allocations to justify having local heaps.
4:52:10
beach
And, yes, allocating directly in the global heap when it is likely that an object will be promoted is a good idea.
4:52:58
hayley
Say, just observing the function causing allocations wouldn't work too well, if some functions like MAP can create objects of very different lifetimes. I'd assume ML has a similar issue, but the ML people looked into region-based memory management, which is already polymorphic, rather than pre-tenuring, I think.
4:53:14
beach
For example: (push <mumble> *variable*). Here, the CONS cell being allocated is going to be referred to by *variable* which may already be in the global heap.
4:55:32
hayley
The problem would arise in something like (let* ((x (make-something)) (y (list x))) (promote x) ...) as the value bound to Y still references the original value of X, and not the copy in the global heap.
4:56:01
hayley
The problem involves references to the promoted object, not references from the promoted object.
5:03:22
beach
Well, the idea was to check for that situation whenever code has been executed that may promote an object. There would be a bitmap in the nursery, indicating what objects have been promoted. The code following a function call would check a bit in that bitmap.
5:04:10
beach
So any lexical variable that the compiler thinks may point to the nursery would then be checked.
5:08:22
beach
Ah, yes, I see what you mean. So I guess when a lexical variable is assigned to, then a check would also have to be made. That might be a lot more costly indeed.
5:09:01
moon-child
(if you have a read barrier for concurrent gc in the global heap, you may be able to reuse it for objects which have moved out of the nursery)
5:09:59
beach
Read barriers sound costly though. I might be wrong of course. Perhaps the check is dwarfed by the memory operation.
5:10:42
moon-child
as I recall, there was some discussion of this in the past, and hayley reported that steve blackburn(?) gave some reason why read barriers are not so expensive as they seem
5:12:26
hayley
This situation is handled by the remembered set I mentioned, I think. The thread-local nursery is split up into a part which was allocated after the last time the write barrier was trapped, and a part which was allocated before. Assignments causing an object in the latter part to refer to an object in the former part cause the object (or slot) to be logged somehow. When the barrier is tripped, it is only necessary to scan the former part of the
5:12:29
moon-child
hmm, actually. It's not per se a problem to have one copy of an object in the nursery and another in the global heap, as long as all you're doing is reading from other. So all you need is a write+EQ barrier
5:15:24
beach
Thanks for the feedback. More thinking required on my part. Please feel free to continue the discussion though. Also, this is Monday, and Monday mornings are chaotic around here. So I'll be off and on for the next few hours.
5:16:54
hayley
moon-child: I had a discussion with Gil Tene (the CTO of Azul) on read barriers, because he had claimed that the read barrier for Pauseless/C4 had lower overhead than the write barrier for G1.
5:18:24
hayley
https://twitter.com/nodefunallowed/status/1433930547265376258 The summary is that the read barrier can only be tripped once per slot per GC cycle, whereas the write barrier can be tripped once per assignment to a slot.
5:19:24
hayley
However, I am not sure if the Pauseless barrier specifically is appropriate for incrementally fixing up pointers to a few relocated objects, as opposed to doing a proper garbage collection.
5:22:41
hayley
Similarly I'd have to think about if the barriers for replication copying would suffice. The issue is that, in incremental copying collection, the collector is often given a "head start" as the roots are always moved to new-space. I also thought there was a read barrier for replication copying, but I don't remember.
5:24:35
hayley
"Replication copying collectors ... relax this requirement by allowing the mutator to continue operating against fromspace originals even while the collector is copying them to newspace." There we go.
5:27:40
hayley
But I don't see how that could work here. We have a copy in the global heap, and writes to the global copy should (eventually) become visible when reading the local copy somehow.
5:38:09
beach
I think I should implement only the global collector initially, and then I'll put you two in charge of coming up with a reasonable nursery technique, including synchronization, barriers, etc.
5:38:53
beach
I am pretty attached to the technique of the global heap, so I am not willing to budge in that respect. At least not initially.
5:40:16
hayley
The Nettles et al paper on replication copying suggests "write operations must modify both to-space and from-space replicas", which I wouldn't know how to do atomically; as we are still lacking transactional memory in 2022.
5:46:30
hayley
Supposing there is a way to atomically update two locations at once (the dreaded "double CAS"), we could still try to fix up any locations that might have pointers to the original object, similar to what beach suggested, reducing the frequency that the barrier is tripped, similar to the property the read barrier for Pauseless has.
5:49:44
hayley
Fixing the registers is likely to be quite fast compared to emulating a double CAS (or taking a lock while writing, or something else that I haven't thought of). If we still need to perform a lot of replicated writes, we could perform a nursery collection and/or adjust our heuristics for allocating into the global heap.
6:28:01
contrapunctus
I've been using `eclector.parse-result:read` to read forms in files; but I can't figure out how to get skipped input (specifically comments), or even its location. I've been trying out different things based on the manual (defining a `make-skipped-input-result` method, trying to access the second return value of `read`), but to no avail...
7:27:25
hayley
moon-child: Do you know any literature on double-CAS or otherwise doing writes to two replicas atomically?
7:30:03
moon-child
linked https://www.cl.cam.ac.uk/research/srg/netos/papers/2002-casn.pdf a while ago
7:32:14
moon-child
'reserves a small and constant amount of space in each word updated (either 0 or 2 bits)' can they be spared? If not, maybe an out-of-line bitmap
7:32:23
hayley
One other complication is that writes to global objects from other threads would require a check, to see if we need to replicate the write to another nursery...and we'd need to store that nursery too, oh dear. It's also necessary to have some sort of synchronisation with the other nursery, so that doing the atomic update won't race with compaction. I guess we could have a hash table of local originals of global replicas, a bit like lazy promotion
7:33:40
moon-child
that said--I guess the idea is something gets copied to the global heap, but a copy stays in t1's nursery, and then t2 tries to modify it? That seems fraught
7:35:13
hayley
We double-CAS to keep the two replicas synchronised. If there are too many DCAS updates, we suck it up and do a nursery collection, to remove replicas.
7:35:56
moon-child
that write constitutes a concurrent modification. If it isn't done with any strong or atomic memory ordering, I think the memory model may allow the state to be incoherent
7:36:37
moon-child
(or, rather, it does if t1 is writing the object at the same time, which is what would cause the problems)
7:37:18
hayley
Sure, but we don't exactly have a memory model, and I think the Java memory model wants us to pick a winning write. But I haven't read the JMM properly, and I would guess replication copiers (like Sapphire) for Java have to handle this problem too.
7:38:32
moon-child
(t1 could have a pointer both to the nursery copy of the object and to the global copy of the object. If t2 updates both in an unsynchronised fashion, t1 could observe the writes in either order, so it could observe a new state followed by the old state. I'm pretty sure the memory model has to allow this, in order to permit things like mem->register promotion)
7:39:21
moon-child
hayley: well, there is bike's proposal. I suspect that's the most relevant thing here, even if it is not finished yet
7:39:47
moon-child
'pick a winning write' well--at _some_ point, you would synchronise; that would happen when you do a nursery collection (or if you have a fence or w/e)
7:40:28
moon-child
and I'll add, value in nursery being different from value in global heap is very similar to value in one core's cache being different to the value in another core's cache. So I would _guess_ this would be ok
7:41:38
hayley
One thing I need to have a think about is it would appear that writing to a replicated object and doing a compaction cannot occur concurrently, which suggests that each nursery requires a lock. In theory this is distressing, as nursery collections are no longer lock-free, but the critical section for updating a replica perhaps isn't too long.
7:43:20
moon-child
ideally the nursery compactor would just fight with the other thread for the specific replicated object
7:43:54
hayley
Hm, it might suffice to disable replication for global objects just before compaction; I'd have to re-read beach's paper on the collection algorithm. But more sinister is a situation like T1 starts a DCAS, T2 starts compacting, T1 now updates into somewhere random, because it didn't know that the object is no longer being replicated.
7:44:19
hayley
ACTION grumbles how many times she says "replicated object", when it has nothing to do with her published description of a "replicated object system".
7:45:42
moon-child
see, this is why I was dubious of letting threads peek into each other's nurseries :P
7:46:16
moon-child
t1 and t2 race to set a flag on that one, and whichever one wins gets to diddle or move it
7:47:33
hayley
Okay, so the nursery collector has to lock every object before compacting? It's quite likely to win, if we disable replication just before, and it'd only lose the race if some other thread races which is unlikely.
7:51:14
moon-child
hmm. So a couple more things: 1, maybe replicated objects should always be moved into the global heap following a minor gc (so at that point you just flush your write and then set a forwarding pointer); 2, alternately, can we demote a replicated object back to a nursery object?
7:56:37
moon-child
hmm, it can be done with a single atomic: reserve a lock bit in the pointer from the global replica to the local one. I don't know how much that helps. But I think the main important point is that if you lose the lock, you can keep doing work; don't have to wait for it
7:59:16
moon-child
if the nursery is semispace rather than sliding, then you can just leave the old object there, and don't care if another thread reads a stale pointer
7:59:31
moon-child
only have to synchronise with the _next_ minor collection, which should not be a problem
8:00:33
hayley
Hm, now I don't know if that DCAS will be atomic with respects to ordinary reads. If we have to do something fancy to make reads synchronise properly, we may as well have some other sort of read barrier.
8:07:58
moon-child
you are probably right. But I still suspect that whatever memory model is ultimately adopted will be weak enough that it doesn't have to be atomic
8:12:12
hayley
Perhaps so. But then, how do we have to behave around a fence? I was hoping to have atomic updates, to avoid the hard problems.
8:13:01
moon-child
fence could check if any updates have been performed on objects within this nursery
8:14:15
moon-child
spitballing--maybe fencing can increment some generation number, and if another thread wants to mutate a replicated object and its generation number is too old, it has to do extra work?
8:14:46
hayley
Though, I could not say what to do if the micro-architecture hates us, or we otherwise keep failing to commit.
8:17:26
hayley
We could predict that, based on how caches and transactional memory tend to work, this scheme won't spuriously fail, but there is no guarantee in the architecture specification, I think.
8:19:44
beach
So we don't have hardware transactional memory, but do we have speculative lock elision? If so, locks might be very inexpensive.
8:20:57
hayley
Perhaps in our dreams. Intel had speculative lock elision, but I think they had to disable it due to bugs.
8:22:30
hayley
However, locking wouldn't help in this situation either. The plan I had was to have some sort of double-CAS, such that unsynchronised reads would observe either the old or the new value in both replicas, without the possibility to read the old in one replica, and the new in another replica.
8:23:59
hayley
Wikipedia claims there will be a "TSXLDTRK" instruction set for Xeon processors set to be released in the second half of this year. Of course, the issue is that it will be a feature you only get on server CPUs, which will not be useful for many users.
8:27:04
moon-child
in fairness, computer scientists (like physicists) are known for being absolutely awful at naming things
8:39:12
hayley
It would appear Sapphire does unsynchronised writes. It also has to replicate writes of unboxed slots, too.
8:53:58
hayley
At the least, we've established that fixing up pointers incrementally is quite similar to replication copying. And the description of Sapphire in the paper already assumes there can be a part of the heap we don't want to copy, and that the whole world doesn't need to be stopped for phase changes, so hopefully we don't have to invent too much now.
8:55:03
hayley
The paper claims "Programs with data races may exhibit new behaviors under Sapphire, but the effects are similar to those allowed when optimizing ([12, p. 378]). Further, the values stored into any field are ones stored by some thread, so Sapphire cannot violate Java type-safety. In any case, for properly synchronized programs, Sapphire produces results consistent with the Java Virtual Machine Specification." So I think we are safe in that regard,
9:18:04
hayley
Hm, the...other paper by Hudson and Moss claims that volatile slots in Java are handled by ensuring that the operations are performed on the newspace replica, if one exists. This is a bit like a read barrier, though it might not appear in generated code many times, if we only have to emit it near barriers.
9:20:19
hayley
Good question. I'm only certain that we need the read barrier when a fence is performed.
9:21:35
moon-child
I don't disagree, but you don't know, at any given point, if a fence was in fact performed
9:22:58
moon-child
hmmm. Maybe compile two versions of every function: one with read barriers, and one without. After you perform a fence, you osr to the read-barrier code. When you do a minor gc, you switch back
9:23:28
moon-child
if doing a lot of fences, you should stay in the read-barrier code pretty much all the time (and maybe the minor gc has some heuristic to not switch back then), so you don't pay so many return mispredicts
9:25:18
hayley
I probably should see what the Sapphire people did. The paper only mentions non-volatile and volatile fields in Java, and nothing on other sorts of atomic access.
9:33:09
hayley
Maybe the issue could be avoided by making sure a mutator flips its roots to newspace, before fencing. But that doesn't help us at all, as we are trying very hard to not flip roots.
9:57:16
hayley
My wild guess at the moment is that the read barrier is only necessary when performing atomic updates, but I have no idea.
10:01:09
hayley
But then something like (let ((x 'old)) (atomically-setf x 'new) x) could return either OLD or NEW in the resulting memory model, which is pretty weird.
10:20:43
hayley
...no it couldn't, what am I saying? One "issue" occurs when a replicated write occurs concurrently with two unsynchronised reads to either replica, and the reads read both the old and new value. But this is fine, as the write can be observed to happen in between reads.
10:24:57
hayley
Another issue occurs when two threads try to perform replicated writes concurrently, in which case the replicas can remain out of sync indefinitely. I think we could get away with locking, if the critical section is only two writes, and the locks have fine granularity (say, one lock bit per 64 bytes, for arbitrary values of 64). A similar thing applies for atomic updates (though, if the user wants lock-free updates, it might be nice to give them
10:29:13
hayley
And, of course, if we spend too much time locking while performing replicated writes, we can cut our losses and just do a minor GC/promotion. Though the first issue might be a real issue, if we observe the new value, and then the old value again, by reading replicas in the opposite order to which they were written. The memory model would likely allow reordering, but not across a fence, which is the real problem here.
10:49:38
hayley
I have a vague idea, but so far there isn't a chance this idea can be done efficiently. It might suffice for a fence to ensure that all locks have been released, i.e. there are no replicated writes occuring concurrently with the fence. Therefore it is not possible for reads occurring after the fence to miss any writes which occurred before the fence, I think.
10:50:45
hayley
Though, again, replication copying is hopefully uncommon and required of very few objects, so a fence might just require iterating over a table of replicated objects, if there are any replicated objects to deal with.
11:09:39
hayley
That would look like thread-local heaps for mutable objects, without having to fix every root when the write barrier is tripped. We only had to invent our own replica coherency protocol in software.
12:10:55
hayley
Perhaps this scheme is too slow in practise; I have no idea and would have to test. Speculative lock elision could be useful, if we get a working implementation in the rest of this century. And we can still decide to avoid local allocation or stomach a minor GC if the time overhead of replication is intolerable.
13:28:32
scymtym
beach: i have already written down some requirements for the environment library. i can try to make those available but i'm not sure whether i will manage to do it today
13:29:22
beach
Great! No rush. I decided Clostrum is good enough for me to continue working on bootstrapping, so I won't work on it for a while.