freenode/#clasp - IRC Chatlog
Search
12:35:22
drmeister
I can only do about 2,000 unwinds/second on this machine with nothing else going on.
12:36:25
drmeister
scymtym: Hi - yeah - but I think our compiler can get rid of 90% of what we are throwing right now.
12:37:08
drmeister
Currently (block foo (let ((*special* ...)) ... (return-from foo ...) ...)) throws an exception.
12:38:25
drmeister
Currently I think Clasp's compiler budget is dominated by eclector because of this.
12:39:34
drmeister
That's not knocking eclector - it's a very common pattern. We need to compile it better.
12:40:04
drmeister
I think it's true in general. It's really hard to measure this. It's a "death by a thousand cuts".
12:41:35
scymtym
yeah, i see. i didn't take it as criticizing eclector. that said, i may have to look into performance issues at some point anyway
12:42:24
drmeister
I think something to keep in mind is "am I using the regular return path for the common case".
12:43:06
drmeister
Also, we have a tool now that lets us find the code that is doing the most throwing.
12:43:50
drmeister
In C++ it's really obvious when you are returning the normal way and when you are throwing an exception.
12:45:55
drmeister
Indeed ... the (block foo (let ((*special* ...)) ... (return-from foo ...) ...)) pattern currently throws because the return-from crosses a function boundary. Once Bike is done with it - it wont
12:48:24
drmeister
In that flame graph above - most of the time is lost throwing exceptions in eclector, cleavir and asdf.
12:53:21
scymtym
the combination of special variables and many embedded state machines (i.e. TAGBODY+GO/RETURN) is probably the main reason for this in eclector
14:55:23
Bike
Another ridiculous aspect of this is that the machinery that parses unwinding info is complex enough that some things could be better expressed with...... exceptions
14:57:44
Bike
drmeister: Since the thread local exception space won't include pointers to any Lisp objects, should I just use a regular thread-local variable instead of putting it in our thread local data structure gizmo?
14:59:44
Bike
well actually there's a bunch of non lisp things in there already, so now i'm not sure what it's for exactly
15:34:41
drmeister
Bike: If there are no pointers to lisp objects then you can put it in a thread-local variable - yes. That will save you an indirection.
15:35:18
drmeister
Everything accessible from my_thread->xxx requires following a pointer - but you can put lisp objects in the my_thread->xxx area.
15:37:41
drmeister
I'm keeping them both running until llvm10 or llvm11 when we can get rid of "object" if we choose.
15:38:22
drmeister
Bike: I just put them there - it didn't seem like a big deal. It would be better to put them somewhere else.
15:38:53
drmeister
https://github.com/clasp-developers/clasp/blob/dev/include/clasp/gctools/threadlocal.fwd.h#L79
15:39:21
drmeister
https://github.com/clasp-developers/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L49
15:42:39
drmeister
I've made a point of keeping all the thread local stuff accessible through my_thread (ThreadLocalState*) and my_thread_low_level (ThreadLocalStateLowLevel*)
15:43:24
drmeister
It is - but when I started thread local storage support was still kind of squirrely - so I didn't want a lot of different thread local objects to deal with.
15:43:57
drmeister
Also, we need to store some Common Lisp pointers in the thread local storage. That is best done by allocating it at the top of the stack and letting the GC deal with it that way.
15:45:18
drmeister
So how about we move the POD/C++ objects from ThreadLocalState to ThreadLocalStateLowLevel and put them directly in thread local storage?
15:45:50
drmeister
It's real simple. We just move the field over and the compiler will complain about every my_thread->_foo access and we change it to my_thread_low_level._foo.
15:46:46
drmeister
Then we avoid an indirection and the GC doesn't have to scan these values and potentially confuse them with pointers. That's less of a problem in our 64-bit world.
15:48:03
drmeister
We can also take them out of ThreadLocalStateLowLevel entirely and have a bunch of thread local values.
15:48:23
Bike
means more #including, and having things only one system needs being stored in some random other file
15:48:52
Bike
I mean, look at the threadlocal structure now - it's just a potpourri of random things
15:49:00
drmeister
The ThreadLocalState struct for Common Lisp pointers is convenient for allocating it at the top of the stack (stack allocates down).
15:49:54
drmeister
But when I wasn't sure that I could put anything but a pointer in thread local storage - it made sense to have a 'god' object. Now it makes less sense.
15:53:55
drmeister
I can't tell you how much of a comfort my_thread->xxx is. I immediately know that it's thread local.
17:25:46
Bike
alright, so this isn't as stable as i thought, __cxa_begin_catch terminates when you try to quit sldb
17:26:56
Bike
because, i don't know, it needs to maintain a global stack of caught exceptions for whatever damn reason, and it can't do that with foreign exceptions
17:30:38
Bike
and it only can't handle foreign exceptions because the itanium abi design doesn't consider the possibility of catching a foreign exception in enough detail. they could have just had their own linked list of _Unwind_Exceptions, but no, they made the cxa exceptions list nodes themselves
19:22:48
drmeister
I guess we figured it was better to use the interpreter than to compile a load-time-value and run it?
19:26:58
drmeister
I'm going to hack ASDF and change probe-file* and resolve-dependency-name. If loading time is cut in half then I will know that unwinds are what are killing our performance.
19:31:32
drmeister
(time (foo 10000)) -> Time real(7.817 secs) run(7.817 secs) consed(1040000 bytes) unwinds(10000)
19:32:17
drmeister
https://github.com/clasp-developers/clasp/blob/dev/src/lisp/kernel/clos/conditions.lsp#L823
19:33:07
drmeister
https://github.com/clasp-developers/clasp/blob/dev/src/lisp/kernel/clos/conditions.lsp#L339
19:36:16
drmeister
Anything that is throwing 1215 times in 0.675 is spending all of its time throwing exceptions.
19:38:57
drmeister
And there is nothing I can do with this is there? Until your approach to avoiding c-w-v-b lands at least.
19:42:14
Bike
also it seems unlikely that asdf is going to put in a change like whatever you're thinking of. can you be patient? I know you want things to be fast now, but I don't want to try rewriting everything to be uglier, especially when we have an actual path moving forward that doesn't require it.
19:53:10
drmeister
Better to show this: Time real(0.828 secs) run(0.828 secs) consed(37821544 bytes) unwinds(939)
19:57:10
drmeister
For reference: (defparameter *a* 1) (defun foo (num) (dotimes (i num) (block ff (let ((*a* 1234)) (return-from ff nil)))))
19:57:26
drmeister
(time (foo 939)) -> Time real(0.639 secs) run(0.639 secs) consed(0 bytes) unwinds(939)
19:58:40
drmeister
(time (load "~/quicklisp/setup.lisp")) -> Time real(0.513 secs) run(0.513 secs) consed(37518464 bytes) unwinds(289)
20:11:38
drmeister
kpoeck: We are making progress speeding things up. Unwinding the stack is tremendously expensive and we finally have a tool to measure it and we are making changes to reduce it.
20:12:10
drmeister
kpoeck: I changed TIME so that it prints the number of Unwinds in the current thread.
20:12:32
drmeister
Watch that number. If it gets up around 1000-2000/second then performance is dominated by unwinding.
20:13:18
drmeister
I added a clasp/src/performance/do-flame-throw script that helps us figure out what code does the most unwinding.
20:16:11
drmeister
Right - but that doesn't reveal the problem because the unwinds are at the very tips of the flames and I chop them off at 400 frames or so. Also - they are thrown all over the place. It's "death by a thousand throws".
20:18:49
drmeister
Oh - I generate the do-flame-throw with an idea that I haven't seen before (I think) - I record backtraces everytime cxa_throw is entered.
20:19:14
drmeister
Then I turn them upside down, chop off everything above the 20th frame and then generate a flame graph from THAT.
20:19:56
drmeister
So the bottom is cxa_throw and above that is cc_unwind and above that is whatever called cc_unwind sorted and grouped together.
20:20:15
drmeister
The boxes then represent relatively how often cxa_throw is entered via a particular code path.
20:21:46
drmeister
Back to throw - right now it looks like compile-file-parallel is slow because of eclector.
20:22:26
drmeister
eclector has a lot of perfectly reasonable code that currently throws exceptions (soon Bike will fix that).
20:27:15
drmeister
Surprising things like loading compiled quicklisp are dominated by the time to unwind.
20:27:54
drmeister
We need to get unwinding down to where at most its a few tens to hundred throws/sec.
20:53:21
drmeister
Kevslinger: How are things going? If you want to talk about the xeos/jupyter thing - I have a bit more time in the next week.
21:16:14
drmeister
Bike: Do you ever anticipate optimizing this away? (catch 'something (funcall (lambda () (throw 'something nil)))
21:16:38
drmeister
WIll that give me a reliable way of generating a throw - so I can measure the time it takes on a system?
21:19:02
drmeister
I want to add something to TIME to print something when we are approaching unwind dominated timing.
21:32:37
drmeister
I need a better clock. C++ has this std library called 'chrono' - it is pretty cool.
21:59:16
kpoeck
(time (dotimes (x 25)(cl-bench.gabriel:run-ctak))) -> Time real(1.032 secs) run(1.032 secs) consed(4866400 bytes) unwinds(304150)
22:06:04
drmeister
kpoeck: I'm not sure what's going on there. That shakes my certainty about the time unwinds take.
22:07:33
Shinmera
I know very little about the current situation, but I wonder if it would be possible to do an optimistic unwind strategy, where you first unwind as if you were not dealing with C++, and only if you encounter C++ frames fall back to using the exceptions mechanism for unwinding.
22:08:04
Bike
I don't think we have a way of identifying C++ frames that doesn't rely on the parts of the exception mechanism that are slow.
22:09:03
Shinmera
If you can make lisp frame identification fast you can also identify C++ frames fast.
22:10:07
Bike
I wonder. We should be able to get the return addresses reasonably quickly I think. Not sure we can get the function addresses from those quickly, though.
22:10:36
Shinmera
could have a form of alternate stack where you mark that when you start a new lisp frame.
22:11:37
Shinmera
basically just a static memory array the size of max stack frames where you enter the addresses whenever you start a frame. Adds cost to every call of course, but I wonder if it's really impactful.
22:12:07
Bike86
i'll think about it. at the moment any special variable binding or unwind protect means C++ frames so i'm going to focus on eliminating that first, though
22:15:14
Shinmera
Once I get |3b|'s sdf+bmfont libraries working together I'd like to see whether Clasp can run Alloy