libera/#sicl - IRC Chatlog
Search
3:09:43
moon-child
so, sbcl supports ATOMIC-INCF on conses. At high safety levels, it will generate a cas loop; at low safety, it will take advantage of XADD (not checking whether the object pointed-to is actually a fixnum). Presumably the latter behaviour is not desirable for sicl. So atomic-incf
3:35:06
Bike
couldn't you do a cas loop generally, and optimize to xadd or whatnot when type inference can prove it's a fixnum
3:37:01
moon-child
I was going to say: I can't imagine there's any meaningful case where you could prove the car of some cons was a fixnum, and you would also want to do atomics on it
3:37:33
moon-child
because it would necessarily shared, so any alias analysis you might try would be nil. Would need global analyses, which I do not think sicl intends to perform
3:38:28
moon-child
but I realised you might be able to perform such an inference in one case: when the cons is only available to closures
3:42:12
Bike
also, this would also apply to more general kinds of things, like standard objects or arrays
3:46:21
moon-child
yeah, just specifically thinking of atomics as a case where hardware acceleration is more significant. Type declaration I don't think would be very helpful
3:47:52
moon-child
(defclass c () ((x :type (cons fixnum fixnum)))) (defparameter *x* (make-instance 'c)). Ok. Now you may protect against (setf (car (slot-value *x* 'x)) 'not-a-fixnum). But how do you protect against (let ((x (slot-value *x* 'x))) (setf (car x) 'not-a-fixnum))?
4:11:48
Bike
moon-child: yeah that's near impossible. i was thinking the more obvious case of a slot being declared fixnum.
4:14:13
Bike
ok but what i meant was that without the type declaration you could still do atomic-incf with the cas loop.
4:15:47
moon-child
hm, I guess I was not entirely clear initially. I don't think there should be a primop for atomic incf on conses, if there's no context where it would generate better code than the cas loop. Cas loop can live in userspace, but a portability library can make up the difference
4:22:45
Bike
i haven't figured out the best way to represent atomics in IR. what i have now is MIR operators to do atomic loads/stores/CAS on addresses. then if you cas a car or whatever it just reduces the car to address operations.
4:24:23
moon-child
I see. I was thinking about the duplication of operators that would be necessary in a more explicit approach
4:26:41
Bike
i also haven't yet implemented atomics for lexical variables, which seems important to completeness, but might require changing a lot of things throughout the compiler
4:30:59
Bike
and i'm not even sure about exposing cas since it's kind of more "internal" than something like atomic-incf or atomic-update, and what if it's better to use load-link or something instead, and blah
4:32:01
Bike
hmm. now that i'm thinking about it maybe not exposing cas and only exposing atomic-update would be good
4:33:39
moon-child
re update: think the implementation should expose more primitive functionality, and leave more 'sugary' things to utility libraries. Also there is an existing portability library which exposes only cas
4:34:50
Bike
but what if a target machine has ll/sc instead of cas? you can emulate cas with ll/sc, but then it's not really "primitive" functionality any more
4:36:19
moon-child
sure. I mean, you always have to decide what to abstract. I think cas and ll/sc both have fairly straightforward implementations in terms of each other, so it's not a huge deal
4:44:09
Bike
i mean practically speaking, it's probably eq except fixnums and singles are eql, but then you effectively have a new predicate between eq and eql
4:45:45
moon-child
I don't think it's a new predicate between eq and eql. It's defining the behaviour of eq in a particular way (which the standard leaves unspecified)
4:46:02
moon-child
hence, your extension is not 'we have cas', but rather 'we have cas, and eq is guaranteed t for eql fixnums'
4:48:21
Bike
an atomics extension requiring a pretty much unrelated extension like that is the kind of thing that bothers my sense of cleanliness
6:37:32
hayley
On the previous topic of free-list versus sequential allocation, the Garbage Collection Handbook states "Blackburn et al [in <https://users.cecs.anu.edu.au/~steveb/pubs/papers/mmtk-sigmetrics-2004.pdf>] showed that the difference in cost between sequential and free-list allocation is small (accounting for only 1% of total execution time) and is dominated by the second order effect of improved locality, particularly for young objects which benefit
6:39:11
beach
But if "objects that are allocated together die together", then they will be sequential also with a free list.
6:43:05
hayley
Right, yes. Could a small proportion of objects that are allocated together not dying together make allocation with a free list less sequential? I haven't gotten around to writing a non-sequential allocator for practise.
6:43:51
beach
The argument was that locality improves performance because the processor will prefetch the next words in memory, right?
6:44:21
moon-child
'they will be sequential also with a free list' not necessarily. Suppose I allocate 6 conses 0123456789, and then free them all. Then I allocate a list of 3 cons cells X (012), a list of 3 cons cells Y (345). then Y is freed. then x is freed. Now the freelist is 345012678. Then I allocate a length-9 list, and it is fragmented
6:44:48
moon-child
beach: explicit prefetch does work, ocaml people found it improved the performance of list iteration, but it is not a panacea
6:45:46
hayley
*prefetching over list iteration - moon-child raced with me, and I ended up reading bogus mental data.
6:47:51
hayley
(Java also prefetches after performing an allocation, which improves allocation performance, and prefetching would probably also work to speed up allocation. But Blackburn et al found that changes in allocation throughput did not affect application throughput much, as little time is spent allocating.)
6:54:23
moon-child
but, I thought the plan was to depend on avx2 anyway, for the three-address float ops?
6:55:34
hayley
I'm going to play around with prefetching after allocation in SBCL, and I don't think they require AVX2.
6:57:33
moon-child
(nb. seems I misread the amd manual. There an amd-specific instruction, but the normal prefetches are always there)
6:58:15
hayley
According to <https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/> HotSpot uses PREFETCHNTA, but I'll just test with PREFETCHT0 for now.
7:12:48
hayley
I can't tell if any prefetch instructions help in SBCL just yet. Maybe I'll run cl-bench and find out that it again runs 0.7% faster on average, with no noticeable patterns in which benchmarks run faster. :)
7:14:43
beach
hayley: What cost did Blackburn et al attribute to copying objects to improve locality?
7:19:16
hayley
A non-generational copying collector takes more time to collect than a non-generational mark-sweep collector, unless a very large heap is used, but a generational collector with a copying nursery and mark-sweep collector for matured space performs the best.
7:21:08
hayley
Notably "cache measurements reveal that the spatial locality of objects allocated close together in time is key for nursery objects, but not as important for mature objects."
7:24:41
hayley
My first statement was my interpretation of the graphs in Figure 1 on page 7 of that paper. I think those are measurements.
7:25:09
moon-child
it doesn't follow from the generational hypothesis that most accesses are to young objects rather than old ones
7:26:51
hayley
Does it follow from the "most writes aren't dead" hypothesis, which came up when we were discussing what Gil Tene told me about read barriers? If most objects die young, and most objects are used somehow...
7:32:40
hayley
From the conclusion: "As a corollary, although many accesses go to mature objects, their performance relies on temporal locality, whereas in the nursery, allocation order provides good spatial locality for young objects that die quickly."
7:35:02
hayley
I guess, if most objects die young, then the temporal locality of accessing young objects may be poor. And there are measurements of the ages of accessed objects, and of the distribution of accesses to objects in Table 1 on page 5.
7:41:14
hayley
For what it's worth, I wouldn't be opposed to having CAS with an equality predicate other than EQ produce a loop, as it shouldn't affect lock-freedom, assuming...some sort of fairness for which thread "wins" a compare-and-swap.