libera/#sicl - IRC Chatlog
Search
6:37:32
hayley
On the previous topic of free-list versus sequential allocation, the Garbage Collection Handbook states "Blackburn et al [in <https://users.cecs.anu.edu.au/~steveb/pubs/papers/mmtk-sigmetrics-2004.pdf>] showed that the difference in cost between sequential and free-list allocation is small (accounting for only 1% of total execution time) and is dominated by the second order effect of improved locality, particularly for young objects which benefit
6:39:11
beach
But if "objects that are allocated together die together", then they will be sequential also with a free list.
6:43:05
hayley
Right, yes. Could a small proportion of objects that are allocated together not dying together make allocation with a free list less sequential? I haven't gotten around to writing a non-sequential allocator for practise.
6:43:51
beach
The argument was that locality improves performance because the processor will prefetch the next words in memory, right?
6:44:21
moon-child
'they will be sequential also with a free list' not necessarily. Suppose I allocate 6 conses 0123456789, and then free them all. Then I allocate a list of 3 cons cells X (012), a list of 3 cons cells Y (345). then Y is freed. then x is freed. Now the freelist is 345012678. Then I allocate a length-9 list, and it is fragmented
6:44:48
moon-child
beach: explicit prefetch does work, ocaml people found it improved the performance of list iteration, but it is not a panacea
6:45:46
hayley
*prefetching over list iteration - moon-child raced with me, and I ended up reading bogus mental data.
6:47:51
hayley
(Java also prefetches after performing an allocation, which improves allocation performance, and prefetching would probably also work to speed up allocation. But Blackburn et al found that changes in allocation throughput did not affect application throughput much, as little time is spent allocating.)
6:54:23
moon-child
but, I thought the plan was to depend on avx2 anyway, for the three-address float ops?
6:55:34
hayley
I'm going to play around with prefetching after allocation in SBCL, and I don't think they require AVX2.
6:57:33
moon-child
(nb. seems I misread the amd manual. There an amd-specific instruction, but the normal prefetches are always there)
6:58:15
hayley
According to <https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/> HotSpot uses PREFETCHNTA, but I'll just test with PREFETCHT0 for now.
7:12:48
hayley
I can't tell if any prefetch instructions help in SBCL just yet. Maybe I'll run cl-bench and find out that it again runs 0.7% faster on average, with no noticeable patterns in which benchmarks run faster. :)
7:14:43
beach
hayley: What cost did Blackburn et al attribute to copying objects to improve locality?
7:19:16
hayley
A non-generational copying collector takes more time to collect than a non-generational mark-sweep collector, unless a very large heap is used, but a generational collector with a copying nursery and mark-sweep collector for matured space performs the best.
7:21:08
hayley
Notably "cache measurements reveal that the spatial locality of objects allocated close together in time is key for nursery objects, but not as important for mature objects."
7:24:41
hayley
My first statement was my interpretation of the graphs in Figure 1 on page 7 of that paper. I think those are measurements.
7:25:09
moon-child
it doesn't follow from the generational hypothesis that most accesses are to young objects rather than old ones
7:26:51
hayley
Does it follow from the "most writes aren't dead" hypothesis, which came up when we were discussing what Gil Tene told me about read barriers? If most objects die young, and most objects are used somehow...
7:32:40
hayley
From the conclusion: "As a corollary, although many accesses go to mature objects, their performance relies on temporal locality, whereas in the nursery, allocation order provides good spatial locality for young objects that die quickly."
7:35:02
hayley
I guess, if most objects die young, then the temporal locality of accessing young objects may be poor. And there are measurements of the ages of accessed objects, and of the distribution of accesses to objects in Table 1 on page 5.
7:41:14
hayley
For what it's worth, I wouldn't be opposed to having CAS with an equality predicate other than EQ produce a loop, as it shouldn't affect lock-freedom, assuming...some sort of fairness for which thread "wins" a compare-and-swap.
11:28:38
hayley
So, the main arguments are that it is difficult to implement atomic updates with mmap, and that TLB shootdowns can be slower than syscalls?
11:31:31
hayley
There is a similar problem to the latter when implementing concurrent garbage collectors using hardware protection, wherein many mprotect calls are made in a batch, but the operating system flushes(?) TLBs after every call. And I think there is also a tradeoff with calling mmap in a grep implementation too.
11:32:11
hayley
Conventional wisdom for the latter is to read whole files into memory if they are small, and mmap if they are large.
11:32:46
hayley
Isn't madvise with the sequential option supposed to do (linear) prefetching? But sure about any other patterns.
11:33:28
moon-child
but when you know what you're going to look at, you can do better. They give an example in the paper. See also ocaml list iteration, mentioned recently :)
11:36:26
moon-child
but atomicity&transactions are the main thing I worry about. Was thinking about them a bit ago too. I have no idea how you make that work well in a large system. Databases tend to be self-contained, but you ideally also want to maintain coherent inter-process state too
11:38:32
hayley
I've discussed it with someone before, and we thought to allow programs to force the supervisor to generate a checkpoint, which would ensure durability (insofar as the hardware is durable too). The other three parts of ACID are harder, but we can just use software transactional memory.
11:40:38
hayley
If you checkpoint threads too, and the supervisor automatically generates a checkpoint while a transaction is being produced, then I would expect that threads will roll forward as usual after a crash, unless some non-determinism influences execution.