freenode/#sicl - IRC Chatlog
Search
1:57:28
Bike
::notify heisig the other trucler thing i probably need for cleavir to use it is being able to store arbitrary optimize info, like for client dependent qualities... i'll write a PR for that too i guess
5:23:59
no-defun-allowed
I can't remember exactly; are floating point addition and multiplication commutative? I know they are not associative, but I can't recall commutativity.
5:26:31
no-defun-allowed
With the existence of NaN, I understand that e.g. 1 + NaN ≠ NaN + 1, but then that reduces to NaN ≠ NaN, so it's not as if anything changed.
5:27:39
jackdaniel
heisig: I once started writing a test suite for cltl2 env accessors and to my surprise sbcl implementations also had plenty of issues (afair it was mostly related to querying);
5:35:02
beach
This is what I watched this morning for my daily exercise, and I found it interesting: https://www.youtube.com/watch?v=9-IWMbJXoLM because the speaker was essentially telling the participants of a Linux conference that Unix and C are not so great, and that we should strive for better things.
5:38:49
beach
Yeah, though I attended the Australian Linux conference when it was held in Dunedin. :)
5:39:41
no-defun-allowed
Holy crap, that's a lot of files to do...something with USB devices which I forgot.
5:42:50
lukego
I've been meaning to look at LLVM one of these days. I also have no desire to interface with it via C++ but could potentially be interested in using its textual IR representation as a target for something.
5:47:37
no-defun-allowed
Also, is there a problem with defining the binary floating-point instructions to be subclasses of BINARY-OPERATION-MIXIN?
5:48:37
beach
That mixin was meant to encode the restriction of the x86 that the destination and the first operand are the same.
5:49:26
no-defun-allowed
That is also the case for SSE floating point instructions - I thought we agreed that the three-address AVX instructions would be too new.
5:50:45
no-defun-allowed
AVX was 2011, AVX2 was 2013. Now I need to double check if the three-address instructions were AVX or AVX2.
5:57:07
no-defun-allowed
I don't have a machine that doesn't have AVX instructions (though, again, some don't have AVX2), and the Steam hardware survey states that 94.77% of computers surveyed support AVX. So I suppose it should be fine.
5:59:58
beach
Also, remember that we want to do something simple, just to get an executable system soon-ish. We absolutely have to count on others implementing more things, once that first step is done.
6:04:41
no-defun-allowed
Right. However, I've now noticed that we at least have to perform the COMMUTATIVE-MIXIN preprocessing for addition and multiplication, as only the second input can be an immediate* with three-address instructions.
6:05:29
no-defun-allowed
*And by "immediate" I mean that it will have to be encoded as a memory input (whatever "m64" is called) which loads some constant value.
6:06:20
no-defun-allowed
A similar transform to what I wrote for integer multiplication and division would be done for floating-point subtraction and division.
6:55:11
lukego
I'm reading some SICL papers. Really fun! One is so used to reading papers from the 1980s about this stuff in a historical "what they were thinking at the time" context but not current work :)
6:56:42
lukego
I've only had a quick read through the Call-site optimization paper but when it talks about eliminating a memory data load (for accessing the symbol-function) is this at the expense of adding a memory code load (for the heap-allocated snippet object)? and if so might that be a net loss because an OoO CPU can better mitigate data latency than control latency (or can this be predicted in practice?)
6:56:54
beach
What I find "amusing" is how much current Common Lisp implementations are based on technology that is no longer the norm.
6:57:01
lukego
I'm not sure I've understood though, have to take another read. is this implemented btw?
6:57:53
lukego
Sorry maybe it's control latency in bother cases i.e. you are loading the symbol-function in order to branch to it so you can't branch until it's loaded.
6:58:35
lukego
I was thinking in LuaJIT terms where the compiler emits a hard-coded branch to the function definition it expects but guards that with a test-and-branch on the symbol-function (so to speak) to detect when this is invalid.
7:00:00
lukego
though LuaJIT has its own whole bag of tricks so it might not be an awful idea to compare notes a bit anyway.
7:03:11
lukego
When you redefine a function then couldn't you just recompile all of its callers at the same time? (I guess this doesn't need to be transitive if their own definitions haven't changed - you might have to patch callers-of-callers to the address of the new definition but it should be compatible)
7:03:54
beach
lukego: And, it is not implemented. But I am pretty sure it's a win, because the only additional work being done is with the two jumps. And we save at least 1 (in SICL, more like 4) memory loads.
7:05:01
beach
You can't recompile a caller from source. It would have to be from a minimally-compiled version of it. But that would take a lot of time because of compiler optimization. And all you would win would be two "free" jumps.
7:07:01
lukego
but again that's me being a LuaJIT hat wearer and wanting to inline everything everywhere.
7:07:33
beach
lukego: Function calls in a normal setting must have an indirection so as to allow for late binding.
7:07:55
beach
lukego: And you need to load the entry point and the static environment from the function object.
7:08:17
lukego
beach: Sort-of, right? I mean you can also do late binding by patching the early-bound code. "become:" in Smalltalk parlance.
7:08:58
beach
That is kind of what I am doing. The snippet is technically part of the caller, and it is patched when the callee changes.
7:09:09
lukego
Can't help but think that CPU capacity and memory bandwidth have been increasing exponentially while the amount of code allocated on the heap has not. So compared with 20 years ago it must be quite cheap now to say "let's visit every FUNCTION object on the heap and ..."
7:10:13
beach
Because according to the callee, you would need to modify the operations of the call sequence.
7:11:02
lukego
but can be mitigated, no? LuaJIT inlines literally every function call but with no loss of late binding nor debug information
7:12:20
beach
So the question then, what if the new callee requires more code to be called than the previous one? Do you move the remaining code? That would end up being very close to recompiling the caller.
7:12:47
lukego
Yeah. And LuaJIT doesn't have a good answer for this. Actually *doing* late binding stuff really messes with performance in practice.
7:12:47
moon-child
in a jit, you would generate a 'unoptimized' caller which performs an indirect jump to the callee and then an 'optimized' version which inlines the callee. When the callee changes, the optimized version of the caller is invalidated and you fall back to the unoptimized version
7:14:59
lukego
Hard to think about this stuff abstractly though. In a given application there will be /something/ limiting performance at the CPU level. Instruction fetching? Data fetching? ALU resources? Branch mispredictions? Instruction window space? In every case some optimizations will help and some will hinder.
7:15:27
lukego
best solution is probably to have well-defined optimizations that the application programmer can take into account. So kudos :)
7:16:15
moon-child
in the linux kernel, every compiled function begins with a NOP sequence, which can be runtime-patched into a direct jump, allowing to upgrade the kernel without rebooting the system. I expect you could do something similar
7:16:23
beach
In this case, aside from cache effects as moon-child points out, my technique does strictly less work than the default case. So it is hard for me to see how I can lose.
7:17:54
no-defun-allowed
However, one should note that they don't use optimisations which make it look like your program isn't being run on a bytecode machine. So I suppose that is always possible for them.
7:18:21
lukego
beach: I dunno. I feel like "cache effects" and "speculative execution" are the first-order problems and e.g. number of instructions executed is second order. have to worry about what hazards can occur in transferring execution from the caller to the callee. any branch via memory load is a bit scary surely.
7:18:49
no-defun-allowed
Rewording: they don't optimize in ways which prevent the creation of an equivalent virtual machine state.
7:19:06
beach
lukego: Branch via memory load is what is traditionally done. My technique avoids that.
7:19:36
lukego
I thought you have an extra branch? from caller to trampoline, from trampoline to callee?
7:20:50
lukego
oh right both branches are like that, right? okay that does start to sounds quite nice :)
7:21:13
beach
moon-child: That NOP trick would not be useful for named calls, because the callee can be redefined arbitrarily.
7:22:10
no-defun-allowed
May I suggest taking a gander through "The design and implementation of the Self Compiler" by Craig Chambers, particuarly section 13.2? That section covers how they handle redefining inlined functions.
7:22:38
beach
lukego: And when the "number of instructions" include multiple loops over the list of argument in order to parse keyword arguments, then I think the number of instructions becomes quite relevant indeed.
7:23:01
lukego
beach: okay yeah this technique makes a lot of sense to me now :). one reasonable question is whether the work saved by the trampoline is worth the additional branch - icache locality argument - but if I were a betting man I'd reckon so.
7:23:46
no-defun-allowed
i.e. from page 168 (as the PDF viewer thinks it is, or page 154 on the paper) of <http://www.wolczko.com/tmp/ChambersThesis.pdf>
7:23:48
moon-child
it occurs to me that if you allocate all the traampolines in the same arena, you would get fairly good locality
7:23:59
moon-child
particularly if many of them are the standard snippet, which will have uniform size
7:24:15
lukego
and if it means people can stop writing hand-optimized compiler macros for the sake of &key processing etc then that's a massive win for the psychological wellbeing of the application programmer.
7:24:24
no-defun-allowed
And a friend and I think it is a very well written thesis, for what it's worth.
7:27:46
beach
moon-child: There is real connection. It is just handy that code won't move for things like instruction cache.
7:28:11
beach
But the important part here is that threads don't have to be patched when the global GC is running.
7:31:43
lukego
Thank you for indulging these shoot-from-the-hip questions. It's very interesting work that you are doing.
7:32:03
beach
Another thing, that I told drmeister about this morning was that I can now trace CAR. ...
7:32:37
beach
If I don't inline CAR, and instead put it in the snippet, when someone wants to trace it, the snippets could be altered to do a normal (traced) call.
7:33:57
beach
I was using CAR as an example, because you can't really take advantage of any knowledge of the return value, at least not in most cases.
7:40:48
Colleen
heisig: Bike said 10 hours, 16 minutes ago: i think an actual define-declaration analog would be out of scope for trucler. but trucler could have a function to read implementation-defined info for a user defined declaration, and maybe one to augment
7:40:48
Colleen
heisig: Bike said 5 hours, 43 minutes ago: the other trucler thing i probably need for cleavir to use it is being able to store arbitrary optimize info, like for client dependent qualities... i'll write a PR for that too i guess
7:56:31
heisig
::notify Bike The question is, should Trucler include functions for reading implementation-defined optimize info? Since they are implementation-defined, that particular implementation can simply subclass optimize-description and provide custom accessors.
9:38:26
heisig
Is 32.5 (error handling in standard functions) still up to date? It says that in SICL, standard functions shouldn't call other standard functions for the sake of precise error reporting.
9:44:04
heisig
Heh, also 32.7 (compiler macros) has been superseded by the recent advances in optimizing call sites :)
9:44:39
heisig
I am just pointing this out because it might confuse newcomers. They won't be able to tell which rule of the style guide is still relevant.
9:45:33
heisig
I am thinking of either rewriting them, or deleting them, or marking them as obsolete/work-in-progress.
9:45:42
beach
Now that no-defun-allowed is working on register allocation, I am trying to update the specification.