freenode/#sicl - IRC Chatlog

1:57:28 Bike ::notify heisig the other trucler thing i probably need for cleavir to use it is being able to store arbitrary optimize info, like for client dependent qualities... i'll write a PR for that too i guess

1:57:28 Colleen Bike: Got it. I'll let heisig know as soon as possible.

3:04:38 beach Good morning everyone!

3:05:15 no-defun-allowed Good morning beach!

5:23:59 no-defun-allowed I can't remember exactly; are floating point addition and multiplication commutative? I know they are not associative, but I can't recall commutativity.

5:24:47 beach Good question. I suppose they are.

5:25:25 beach lukego: Did you see my answer about LLVM?

5:26:31 no-defun-allowed With the existence of NaN, I understand that e.g. 1 + NaN ≠ NaN + 1, but then that reduces to NaN ≠ NaN, so it's not as if anything changed.

5:27:39 jackdaniel heisig: I once started writing a test suite for cltl2 env accessors and to my surprise sbcl implementations also had plenty of issues (afair it was mostly related to querying);

5:27:39 beach I tend to agree.

5:28:02 beach no-defun-allowed: ^

5:28:09 no-defun-allowed Right, thanks.

5:35:02 beach This is what I watched this morning for my daily exercise, and I found it interesting: https://www.youtube.com/watch?v=9-IWMbJXoLM because the speaker was essentially telling the participants of a Linux conference that Unix and C are not so great, and that we should strive for better things.

5:36:53 beach no-defun-allowed: I think he might be Australian, or possibly Kiwi.

5:38:16 no-defun-allowed Quite likely at an Australian Linux conference. I can hear it though.

5:38:49 beach Yeah, though I attended the Australian Linux conference when it was held in Dunedin. :)

5:39:41 no-defun-allowed Holy crap, that's a lot of files to do...something with USB devices which I forgot.

5:40:32 beach Yeah, I didn't understand all that, but I got the gist of it.

5:40:41 beach He is a good speaker I think.

5:41:19 lukego beach: I did now. thanks!

5:41:28 beach Sure.

5:42:50 lukego I've been meaning to look at LLVM one of these days. I also have no desire to interface with it via C++ but could potentially be interested in using its textual IR representation as a target for something.

5:43:07 lukego but I hear you on compilation time and overall complexity and such.

5:43:11 beach Sure, I understand.

5:47:37 no-defun-allowed Also, is there a problem with defining the binary floating-point instructions to be subclasses of BINARY-OPERATION-MIXIN?

5:48:07 beach I believe there is.

5:48:37 beach That mixin was meant to encode the restriction of the x86 that the destination and the first operand are the same.

5:49:03 beach But for floating point, we are using 3-"address" instructions as I recall.

5:49:26 no-defun-allowed That is also the case for SSE floating point instructions - I thought we agreed that the three-address AVX instructions would be too new.

5:49:44 beach Did we?

5:49:48 beach I don't remember that.

5:49:58 beach How new are they?

5:50:45 no-defun-allowed AVX was 2011, AVX2 was 2013. Now I need to double check if the three-address instructions were AVX or AVX2.

5:51:15 beach I think that's old enough that we should use them.

5:51:25 no-defun-allowed All AVX. Okay.

5:57:07 no-defun-allowed I don't have a machine that doesn't have AVX instructions (though, again, some don't have AVX2), and the Steam hardware survey states that 94.77% of computers surveyed support AVX. So I suppose it should be fine.

5:57:55 beach Yes, and that number will increase by the time we are done. :)

5:58:13 no-defun-allowed Sure :)

5:59:58 beach Also, remember that we want to do something simple, just to get an executable system soon-ish. We absolutely have to count on others implementing more things, once that first step is done.

6:04:41 no-defun-allowed Right. However, I've now noticed that we at least have to perform the COMMUTATIVE-MIXIN preprocessing for addition and multiplication, as only the second input can be an immediate* with three-address instructions.

6:05:15 beach Makes sense.

6:05:29 no-defun-allowed *And by "immediate" I mean that it will have to be encoded as a memory input (whatever "m64" is called) which loads some constant value.

6:06:20 no-defun-allowed A similar transform to what I wrote for integer multiplication and division would be done for floating-point subtraction and division.

6:06:28 beach Let's see. All non-trivial constants are already like that.

6:06:45 beach The only immediate inputs you need to deal with are small constant numbers.

6:07:57 no-defun-allowed To be clear, does "small constant numbers" include floating-point numbers?

6:08:14 beach I don't think so.

6:09:47 no-defun-allowed I'll go check then.

6:10:24 beach I am pretty sure.

6:10:40 beach It tests for integers and characters only.

6:10:57 no-defun-allowed Okay, thanks.

6:26:55 splittist Good morning the usual suspects

6:28:06 no-defun-allowed Good morning splittist.

6:36:50 beach Hello splittist.

6:55:11 lukego I'm reading some SICL papers. Really fun! One is so used to reading papers from the 1980s about this stuff in a historical "what they were thinking at the time" context but not current work :)

6:56:07 beach Thanks.

6:56:24 beach Yes, it's different.

6:56:42 lukego I've only had a quick read through the Call-site optimization paper but when it talks about eliminating a memory data load (for accessing the symbol-function) is this at the expense of adding a memory code load (for the heap-allocated snippet object)? and if so might that be a net loss because an OoO CPU can better mitigate data latency than control latency (or can this be predicted in practice?)

6:56:54 beach What I find "amusing" is how much current Common Lisp implementations are based on technology that is no longer the norm.

6:57:01 lukego I'm not sure I've understood though, have to take another read. is this implemented btw?

6:57:53 lukego Sorry maybe it's control latency in bother cases i.e. you are loading the symbol-function in order to branch to it so you can't branch until it's loaded.

6:58:35 lukego I was thinking in LuaJIT terms where the compiler emits a hard-coded branch to the function definition it expects but guards that with a test-and-branch on the symbol-function (so to speak) to detect when this is invalid.

6:58:56 beach Let me digest all that for a while...

6:59:23 lukego yeah sorry I am mixing up mental models. let me try to reframe that :)

7:00:00 lukego though LuaJIT has its own whole bag of tricks so it might not be an awful idea to compare notes a bit anyway.

7:01:29 lukego Let me scratch that whole question for a moment and ask a different one :-)

7:01:43 beach OK, that makes it easier for me. :)

7:03:11 lukego When you redefine a function then couldn't you just recompile all of its callers at the same time? (I guess this doesn't need to be transitive if their own definitions haven't changed - you might have to patch callers-of-callers to the address of the new definition but it should be compatible)

7:03:54 beach lukego: And, it is not implemented. But I am pretty sure it's a win, because the only additional work being done is with the two jumps. And we save at least 1 (in SICL, more like 4) memory loads.

7:05:01 beach You can't recompile a caller from source. It would have to be from a minimally-compiled version of it. But that would take a lot of time because of compiler optimization. And all you would win would be two "free" jumps.

7:06:30 moon-child the jumps are free in principle, but they harm locality

7:06:36 lukego Sure would be nice if function calls didn't require a memory load at all

7:06:49 moon-child important on modern architectures where cache is the categorical imperative

7:06:52 beach moon-child: I can see that, sure.

7:07:01 lukego but again that's me being a LuaJIT hat wearer and wanting to inline everything everywhere.

7:07:33 beach lukego: Function calls in a normal setting must have an indirection so as to allow for late binding.

7:07:44 lukego anyway thanks for the feedback I'll keep reading :)

7:07:55 beach lukego: And you need to load the entry point and the static environment from the function object.

7:08:17 lukego beach: Sort-of, right? I mean you can also do late binding by patching the early-bound code. "become:" in Smalltalk parlance.

7:08:58 beach That is kind of what I am doing. The snippet is technically part of the caller, and it is patched when the callee changes.

7:09:09 lukego Can't help but think that CPU capacity and memory bandwidth have been increasing exponentially while the amount of code allocated on the heap has not. So compared with 20 years ago it must be quite cheap now to say "let's visit every FUNCTION object on the heap and ..."

7:09:27 lukego Yeah, indeed, that's what got me thinking.

7:09:51 beach You could not patch the function object itself...

7:10:12 lukego I need to read the paper again and think about how it relates to inlining.

7:10:13 beach Because according to the callee, you would need to modify the operations of the call sequence.

7:10:34 beach Inlining is strictly more powerful, but defeats late binding.

7:11:02 lukego but can be mitigated, no? LuaJIT inlines literally every function call but with no loss of late binding nor debug information

7:11:34 beach Well, then you would have to patch the caller when the callee changes.

7:12:20 beach So the question then, what if the new callee requires more code to be called than the previous one? Do you move the remaining code? That would end up being very close to recompiling the caller.

7:12:47 lukego Yeah. And LuaJIT doesn't have a good answer for this. Actually *doing* late binding stuff really messes with performance in practice.

7:12:47 moon-child in a jit, you would generate a 'unoptimized' caller which performs an indirect jump to the callee and then an 'optimized' version which inlines the callee. When the callee changes, the optimized version of the caller is invalidated and you fall back to the unoptimized version

7:12:53 beach So by putting the code for the call in the snippet, I avoid such problems.

7:13:23 beach moon-child: I see.

7:13:54 beach How do they handle callers that are already on the stack being executed?

7:14:42 moon-child I don't know

7:14:53 beach Sounds messy.

7:14:59 lukego Hard to think about this stuff abstractly though. In a given application there will be /something/ limiting performance at the CPU level. Instruction fetching? Data fetching? ALU resources? Branch mispredictions? Instruction window space? In every case some optimizations will help and some will hinder.

7:15:27 lukego best solution is probably to have well-defined optimizations that the application programmer can take into account. So kudos :)

7:16:15 moon-child in the linux kernel, every compiled function begins with a NOP sequence, which can be runtime-patched into a direct jump, allowing to upgrade the kernel without rebooting the system. I expect you could do something similar

7:16:23 beach In this case, aside from cache effects as moon-child points out, my technique does strictly less work than the default case. So it is hard for me to see how I can lose.

7:16:42 no-defun-allowed In the case of Self, they reconstruct the stack and registers.

7:17:03 beach That sounds really messy.

7:17:21 beach But that would be a requirement for sane semantics.

7:17:54 no-defun-allowed However, one should note that they don't use optimisations which make it look like your program isn't being run on a bytecode machine. So I suppose that is always possible for them.

7:18:21 lukego beach: I dunno. I feel like "cache effects" and "speculative execution" are the first-order problems and e.g. number of instructions executed is second order. have to worry about what hazards can occur in transferring execution from the caller to the callee. any branch via memory load is a bit scary surely.

7:18:49 no-defun-allowed Rewording: they don't optimize in ways which prevent the creation of an equivalent virtual machine state.

7:18:50 lukego but then this stuff is all way too complicated to work out on irc anyway :)

7:19:06 beach lukego: Branch via memory load is what is traditionally done. My technique avoids that.

7:19:36 lukego I thought you have an extra branch? from caller to trampoline, from trampoline to callee?

7:20:03 beach Yes, but it's an unconditional jump to a fixed, constant, address.

7:20:17 beach Not via a memory load.

7:20:50 lukego oh right both branches are like that, right? okay that does start to sounds quite nice :)

7:21:02 beach Yes, both are like that.

7:21:13 beach moon-child: That NOP trick would not be useful for named calls, because the callee can be redefined arbitrarily.

7:21:22 beach That's what Common Lisp late binding does.

7:22:10 no-defun-allowed May I suggest taking a gander through "The design and implementation of the Self Compiler" by Craig Chambers, particuarly section 13.2? That section covers how they handle redefining inlined functions.

7:22:38 beach lukego: And when the "number of instructions" include multiple loops over the list of argument in order to parse keyword arguments, then I think the number of instructions becomes quite relevant indeed.

7:23:01 lukego beach: okay yeah this technique makes a lot of sense to me now :). one reasonable question is whether the work saved by the trampoline is worth the additional branch - icache locality argument - but if I were a betting man I'd reckon so.

7:23:11 beach no-defun-allowed: Good idea. I'll see when I get the energy to do that.

7:23:46 no-defun-allowed i.e. from page 168 (as the PDF viewer thinks it is, or page 154 on the paper) of <http://www.wolczko.com/tmp/ChambersThesis.pdf>

7:23:48 moon-child it occurs to me that if you allocate all the traampolines in the same arena, you would get fairly good locality

7:23:56 beach lukego: I am pleased that I was able to explain it in sufficient detail.

7:23:59 moon-child particularly if many of them are the standard snippet, which will have uniform size

7:24:12 beach moon-child: I can't control that easily.

7:24:15 lukego and if it means people can stop writing hand-optimized compiler macros for the sake of &key processing etc then that's a massive win for the psychological wellbeing of the application programmer.

7:24:24 no-defun-allowed And a friend and I think it is a very well written thesis, for what it's worth.

7:24:29 beach no-defun-allowed: Thanks for the link.

7:24:30 moon-child beach: why not?

7:25:08 beach Because the global heap is a Doug Leaa-type malloc/free heap.

7:25:55 beach I mean, I guess I could leave holes after the caller, but that sounds complicated.

7:26:27 moon-child why use such a heap design?

7:26:44 beach So that code won't move.

7:27:05 moon-child I don't see the connection

7:27:07 beach ... and so that I don't have to synchronize between threads.

7:27:46 beach moon-child: There is real connection. It is just handy that code won't move for things like instruction cache.

7:28:11 beach But the important part here is that threads don't have to be patched when the global GC is running.

7:28:26 beach So the global GC can be parallel and concurrent.

7:28:45 moon-child ahh

7:29:33 beach *There is NO real connection. At least not with the call-site stuff.

7:29:46 beach But the design of the global heap makes things so much easier.

7:31:20 beach lukego: Exactly. You will essentially get automatic compiler macros for free.

7:31:43 lukego Thank you for indulging these shoot-from-the-hip questions. It's very interesting work that you are doing.

7:31:52 beach Thank you. Pleasure!

7:32:03 beach Another thing, that I told drmeister about this morning was that I can now trace CAR. ...

7:32:37 beach If I don't inline CAR, and instead put it in the snippet, when someone wants to trace it, the snippets could be altered to do a normal (traced) call.

7:32:59 beach So the negative effects of inlining can be largely mitigated.

7:33:57 beach I was using CAR as an example, because you can't really take advantage of any knowledge of the return value, at least not in most cases.

7:40:48 heisig Good morning!

7:40:48 Colleen heisig: Bike said 10 hours, 16 minutes ago: i think an actual define-declaration analog would be out of scope for trucler. but trucler could have a function to read implementation-defined info for a user defined declaration, and maybe one to augment

7:40:48 Colleen heisig: Bike said 5 hours, 43 minutes ago: the other trucler thing i probably need for cleavir to use it is being able to store arbitrary optimize info, like for client dependent qualities... i'll write a PR for that too i guess

7:40:59 beach Hello heisig.

7:45:55 no-defun-allowed Good morning heisig!

7:56:31 heisig ::notify Bike The question is, should Trucler include functions for reading implementation-defined optimize info? Since they are implementation-defined, that particular implementation can simply subclass optimize-description and provide custom accessors.

7:56:31 Colleen heisig: Got it. I'll let Bike know as soon as possible.

9:38:26 heisig Is 32.5 (error handling in standard functions) still up to date? It says that in SICL, standard functions shouldn't call other standard functions for the sake of precise error reporting.

9:39:27 heisig I'd have to rewrite many sequence functions if we are strict about this.

9:43:39 beach I don't think that is such a strong requirement anymore.

9:44:04 heisig Heh, also 32.7 (compiler macros) has been superseded by the recent advances in optimizing call sites :)

9:44:39 heisig I am just pointing this out because it might confuse newcomers. They won't be able to tell which rule of the style guide is still relevant.

9:45:20 beach Yes, you are right.

9:45:33 heisig I am thinking of either rewriting them, or deleting them, or marking them as obsolete/work-in-progress.

9:45:42 beach Now that no-defun-allowed is working on register allocation, I am trying to update the specification.