freenode/#clasp - IRC Chatlog

12:01:50 drmeister Good morning

12:33:07 drmeister I'm pretty sure now that removing c-w-v-b will have profound effects on clasp.

12:33:35 drmeister unwinding is a speed bump and we are doing far too much of it.

12:33:41 drmeister This is what I get from this...

12:33:59 drmeister https://usercontent.irccloud-cdn.com/file/u5r4SdSo/image.png

12:34:09 drmeister (time (loop repeat 100 do (load "~/quicklisp/setup.lisp")))

12:34:25 drmeister That's profiling three seconds worth.

12:35:01 drmeister It throws 5,871 times in those three seconds. That's 5,871 little speed bumps.

12:35:22 drmeister I can only do about 2,000 unwinds/second on this machine with nothing else going on.

12:35:36 scymtym eclector relies on conditions pretty heavily in places

12:36:25 drmeister scymtym: Hi - yeah - but I think our compiler can get rid of 90% of what we are throwing right now.

12:37:08 drmeister Currently (block foo (let ((*special* ...)) ... (return-from foo ...) ...)) throws an exception.

12:37:41 drmeister Your ECLECTOR.READER:READ-COMMON and lots of other code uses that pattern.

12:37:54 scymtym yeah, ok, eliminating the excpetion in that seems good

12:38:25 drmeister Currently I think Clasp's compiler budget is dominated by eclector because of this.

12:39:20 scymtym is this true in general or only for processing setup.lisp repeatedly?

12:39:34 drmeister That's not knocking eclector - it's a very common pattern. We need to compile it better.

12:40:04 drmeister I think it's true in general. It's really hard to measure this. It's a "death by a thousand cuts".

12:41:35 scymtym yeah, i see. i didn't take it as criticizing eclector. that said, i may have to look into performance issues at some point anyway

12:42:10 scymtym but conformance, completeness, good error messages and error recovery for now

12:42:24 drmeister I think something to keep in mind is "am I using the regular return path for the common case".

12:43:06 drmeister Also, we have a tool now that lets us find the code that is doing the most throwing.

12:43:50 drmeister In C++ it's really obvious when you are returning the normal way and when you are throwing an exception.

12:44:04 drmeister In Common Lisp it's much harder to figure that out.

12:45:55 drmeister Indeed ... the (block foo (let ((*special* ...)) ... (return-from foo ...) ...)) pattern currently throws because the return-from crosses a function boundary. Once Bike is done with it - it wont

12:48:24 drmeister In that flame graph above - most of the time is lost throwing exceptions in eclector, cleavir and asdf.

12:53:21 scymtym the combination of special variables and many embedded state machines (i.e. TAGBODY+GO/RETURN) is probably the main reason for this in eclector

12:58:42 drmeister All good stuff. We need to handle it better. P

13:20:52 selwyn hi everyone

13:21:04 selwyn drmeister: how are the flamegraphs?

13:21:45 selwyn unfortunately i have not been able to build recently to look at them

14:55:23 Bike Another ridiculous aspect of this is that the machinery that parses unwinding info is complex enough that some things could be better expressed with...... exceptions

14:55:31 Bike but if i use exceptions while handling an exception i will probably die

14:57:44 Bike drmeister: Since the thread local exception space won't include pointers to any Lisp objects, should I just use a regular thread-local variable instead of putting it in our thread local data structure gizmo?

14:59:44 Bike well actually there's a bunch of non lisp things in there already, so now i'm not sure what it's for exactly

15:33:49 drmeister Hi selwyn

15:34:41 drmeister Bike: If there are no pointers to lisp objects then you can put it in a thread-local variable - yes. That will save you an indirection.

15:35:18 drmeister Everything accessible from my_thread->xxx requires following a pointer - but you can put lisp objects in the my_thread->xxx area.

15:36:05 kpoeck Hello

15:36:31 kpoeck is CLASP_BUILD_MODE = "object"

15:37:14 drmeister kpoeck: By default on macOS it's "faso" - on linux and freeBSD it's "object"

15:37:41 drmeister I'm keeping them both running until llvm10 or llvm11 when we can get rid of "object" if we choose.

15:37:48 Bike drmeister: So why are the non-lisp things in the my_thread there?

15:38:22 drmeister Bike: I just put them there - it didn't seem like a big deal. It would be better to put them somewhere else.

15:38:37 drmeister There are two thread local areas.

15:38:45 Bike ok.

15:38:46 drmeister ThreadLocalStateLowLevel

15:38:53 drmeister https://github.com/clasp-developers/clasp/blob/dev/include/clasp/gctools/threadlocal.fwd.h#L79

15:38:59 drmeister And ThreadLocalState

15:39:21 drmeister https://github.com/clasp-developers/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L49

15:39:28 drmeister Here's an idea.

15:39:59 drmeister They are both accessed using thread local pointers.

15:40:21 drmeister We could designate one of them to only contain POD and C++ objects.

15:40:27 drmeister The other only Common Lisp objects.

15:40:27 kpoeck The slowdown I experienced on macosx is fixed?

15:40:50 drmeister kpoeck: Not yet - but I know how to fix it. I'll do it soon.

15:40:57 Bike is there an advantage to centralizing everything thread local?

15:41:16 drmeister The fix won't work on linux until llvm10 or llvm11.

15:41:50 drmeister Bike: Tidiness.

15:42:20 Bike you think? seems like kind of a god object to me

15:42:39 drmeister I've made a point of keeping all the thread local stuff accessible through my_thread (ThreadLocalState*) and my_thread_low_level (ThreadLocalStateLowLevel*)

15:43:24 drmeister It is - but when I started thread local storage support was still kind of squirrely - so I didn't want a lot of different thread local objects to deal with.

15:43:57 drmeister Also, we need to store some Common Lisp pointers in the thread local storage. That is best done by allocating it at the top of the stack and letting the GC deal with it that way.

15:44:29 drmeister I had trouble getting boehm to work with arbitrary blocks of roots back then.

15:44:39 Bike yeah, I understand doing it with lisp pointers.

15:45:18 drmeister So how about we move the POD/C++ objects from ThreadLocalState to ThreadLocalStateLowLevel and put them directly in thread local storage?

15:45:50 drmeister It's real simple. We just move the field over and the compiler will complain about every my_thread->_foo access and we change it to my_thread_low_level._foo.

15:46:30 Bike but i don't think centralizing things is a good idea generally.

15:46:46 drmeister Then we avoid an indirection and the GC doesn't have to scan these values and potentially confuse them with pointers. That's less of a problem in our 64-bit world.

15:46:57 drmeister Why?

15:48:03 drmeister We can also take them out of ThreadLocalStateLowLevel entirely and have a bunch of thread local values.

15:48:23 Bike means more #including, and having things only one system needs being stored in some random other file

15:48:52 Bike I mean, look at the threadlocal structure now - it's just a potpourri of random things

15:49:00 drmeister The ThreadLocalState struct for Common Lisp pointers is convenient for allocating it at the top of the stack (stack allocates down).

15:49:01 Bike text segment start? interrupt records? bignum registers?

15:49:19 drmeister Yeah.

15:49:24 Bike yes, like i said, i understand doing it for lisp objects, because of how the GC works

15:49:33 Bike but for anything else I don't see any point

15:49:54 drmeister But when I wasn't sure that I could put anything but a pointer in thread local storage - it made sense to have a 'god' object. Now it makes less sense.

15:50:26 drmeister Sure - so we could break up ThreadLocalStateLowLevel.

15:51:04 Bike so for the exception i'll just put it elsewhere and see how it goes

15:51:23 drmeister Let's adopt a convention though. For global POD objects I name them global_XXX

15:51:42 drmeister Can we use thread_local_XXX

15:51:55 drmeister or tl_XXX

15:52:02 drmeister Something we can search for.

15:52:28 Bike sure, why not.

15:52:36 Bike we could also just look for the THREAD_LOCAL, though

15:53:27 drmeister Yeah - we could - but let's put it in the name. It will be easier to read.

15:53:55 drmeister I can't tell you how much of a comfort my_thread->xxx is. I immediately know that it's thread local.

15:55:18 Bike alright

16:13:04 Bike hm, thread local variables can't be uninitialized... that's annoying

17:25:46 Bike alright, so this isn't as stable as i thought, __cxa_begin_catch terminates when you try to quit sldb

17:25:53 Bike i don't know why. i think i'm done with this

17:26:56 Bike because, i don't know, it needs to maintain a global stack of caught exceptions for whatever damn reason, and it can't do that with foreign exceptions

17:30:38 Bike and it only can't handle foreign exceptions because the itanium abi design doesn't consider the possibility of catching a foreign exception in enough detail. they could have just had their own linked list of _Unwind_Exceptions, but no, they made the cxa exceptions list nodes themselves

19:21:31 drmeister I found where the interpreter is being called...

19:21:50 drmeister https://usercontent.irccloud-cdn.com/file/nCSw4HzW/image.png

19:22:32 Bike doesn't cl:eval call the compiler now?

19:22:48 drmeister I guess we figured it was better to use the interpreter than to compile a load-time-value and run it?

19:23:02 Bike mm, maybe

19:24:04 drmeister I picked up your sicl change.

19:24:26 drmeister This is what I get when I load quicklisp/setup.lisp 100x (after it's built)

19:24:45 drmeister https://usercontent.irccloud-cdn.com/file/WXyJO6hW/image.png

19:26:58 drmeister I'm going to hack ASDF and change probe-file* and resolve-dependency-name. If loading time is cut in half then I will know that unwinds are what are killing our performance.

19:29:29 drmeister clhs ignore-errors

19:29:29 specbot http://www.lispworks.com/reference/HyperSpec/Body/m_ignore.htm

19:31:01 Bike so the normalize-function-type is gone, great

19:31:06 drmeister Dang nab it!

19:31:08 drmeister (defun foo (num) (dotimes (i num) (ignore-errors (values 1 2 3))))

19:31:32 drmeister (time (foo 10000)) -> Time real(7.817 secs) run(7.817 secs) consed(1040000 bytes) unwinds(10000)

19:32:17 drmeister https://github.com/clasp-developers/clasp/blob/dev/src/lisp/kernel/clos/conditions.lsp#L823

19:33:06 drmeister handler-case

19:33:07 drmeister https://github.com/clasp-developers/clasp/blob/dev/src/lisp/kernel/clos/conditions.lsp#L339

19:33:27 drmeister Bike: You suspected that handler-case had this problem yesterday - didn't you?

19:35:02 Bike i did

19:35:23 Bike want me to try changing it?

19:35:33 drmeister Sure - you are better at it than I.

19:35:44 drmeister Anything else we want to do while you are at it?

19:35:56 drmeister When I see things like this - I know we have a problem...

19:35:57 drmeister Time real(0.675 secs) run(0.675 secs) consed(16136168 bytes) unwinds(1215)

19:36:16 drmeister Anything that is throwing 1215 times in 0.675 is spending all of its time throwing exceptions.

19:37:22 Bike nothing comes to mind

19:37:44 Bike signal is going to unwind most of the time, but that's basically what it's for anyway

19:38:57 drmeister And there is nothing I can do with this is there? Until your approach to avoiding c-w-v-b lands at least.

19:39:08 drmeister https://www.irccloud.com/pastebin/hgHJJqzZ/

19:40:00 Bike probably not

19:40:18 Bike the restart-case has to be inside the loop

19:40:32 drmeister It feels like I should be able to drop out of the restart-case normally though.

19:40:45 drmeister And then exit the loop.

19:40:58 Bike normally the retry restart loops

19:41:55 drmeister What about a tagbody/go and have retry jump to the top.

19:42:14 Bike also it seems unlikely that asdf is going to put in a change like whatever you're thinking of. can you be patient? I know you want things to be fast now, but I don't want to try rewriting everything to be uglier, especially when we have an actual path moving forward that doesn't require it.

19:42:16 drmeister That doesn't return anything though.

19:42:34 drmeister This is just for an experiment.

19:42:42 drmeister The handler-case is not an experiment - that we can fix.

19:43:03 Bike i'm making sure my handler-case change builds, and then i'll push it

19:43:05 drmeister The asdf changes are just to see what happens to the quicklisp load times.

19:43:49 drmeister Hmm, I could just get rid of the loop.

19:44:02 Bike if you don't mind not being able to retry, sure

19:44:19 drmeister I don't for this.

19:44:40 drmeister Are you going to push a new handler-case sooner or later?

19:45:02 drmeister 'cause I can do something else and you tell me when you have it fixed.

19:45:03 Bike i wanna make sure it builds first. assuming it does, it'll be like twenty minutes

19:45:18 drmeister No problemo - thanks!

19:45:27 Bike https://pastebin.com/nMQrefYR if you want to try now

19:46:53 drmeister For the record, on this machine, compiled quicklisp loads in 0.828 secs

19:53:10 drmeister Better to show this: Time real(0.828 secs) run(0.828 secs) consed(37821544 bytes) unwinds(939)

19:57:10 drmeister For reference: (defparameter *a* 1) (defun foo (num) (dotimes (i num) (block ff (let ((*a* 1234)) (return-from ff nil)))))

19:57:26 drmeister (time (foo 939)) -> Time real(0.639 secs) run(0.639 secs) consed(0 bytes) unwinds(939)

19:57:44 drmeister So there's like 75% of my budget wasted to unwinding.

19:58:27 drmeister After making those changes...

19:58:40 drmeister (time (load "~/quicklisp/setup.lisp")) -> Time real(0.513 secs) run(0.513 secs) consed(37518464 bytes) unwinds(289)

19:59:03 drmeister Shaved off 0.3 seconds.

19:59:14 drmeister Unwinds = BAD

20:01:55 drmeister Here's what the flame-throw graph looks like now

20:01:59 drmeister https://usercontent.irccloud-cdn.com/file/PWj2r5lk/image.png

20:03:11 Bike pushed.

20:06:35 drmeister Thank you.

20:10:32 drmeister We are winning here.

20:10:39 kpoeck Hi

20:10:54 drmeister Hi kpoeck

20:11:38 drmeister kpoeck: We are making progress speeding things up. Unwinding the stack is tremendously expensive and we finally have a tool to measure it and we are making changes to reduce it.

20:12:10 drmeister kpoeck: I changed TIME so that it prints the number of Unwinds in the current thread.

20:12:32 drmeister Watch that number. If it gets up around 1000-2000/second then performance is dominated by unwinding.

20:12:55 kpoeck will try cl-bench with time

20:13:18 drmeister I added a clasp/src/performance/do-flame-throw script that helps us figure out what code does the most unwinding.

20:13:59 drmeister Yeah - anything in cl-bench that shows a lot of unwinding is interesting.

20:14:04 kpoeck Thats the tool, right (do-flame-throw)

20:14:20 kpoeck ?

20:14:24 drmeister Yeah - it's quirky and fails sometimes.

20:14:30 kpoeck lemme try

20:14:30 drmeister It's a shell script.

20:14:43 drmeister It uses dtrace and it only works on macOS

20:14:56 drmeister ./do-flame-throw <pid> <time>

20:15:07 drmeister <time> is like 3s or 10s for 3 seconds and 10 seconds.

20:15:13 kpoeck ./do-flame is used a lot

20:15:23 kpoeck so should be easy peasy

20:16:11 drmeister Right - but that doesn't reveal the problem because the unwinds are at the very tips of the flames and I chop them off at 400 frames or so. Also - they are thrown all over the place. It's "death by a thousand throws".

20:16:18 drmeister Each one takes about 500 milliseconds.

20:16:51 drmeister Ok, if you have any thoughts or insights - I'd love to hear them.

20:17:02 drmeister I'm going to get on to other things - like actually using cando.

20:18:49 drmeister Oh - I generate the do-flame-throw with an idea that I haven't seen before (I think) - I record backtraces everytime cxa_throw is entered.

20:19:14 drmeister Then I turn them upside down, chop off everything above the 20th frame and then generate a flame graph from THAT.

20:19:56 drmeister So the bottom is cxa_throw and above that is cc_unwind and above that is whatever called cc_unwind sorted and grouped together.

20:20:15 drmeister The boxes then represent relatively how often cxa_throw is entered via a particular code path.

20:20:43 drmeister I think this is generally useful.

20:20:53 drmeister I used it to write do-flame-interp

20:21:08 drmeister It shows what code calls the interpreter and how it is called.

20:21:46 drmeister Back to throw - right now it looks like compile-file-parallel is slow because of eclector.

20:22:26 drmeister eclector has a lot of perfectly reasonable code that currently throws exceptions (soon Bike will fix that).

20:22:35 drmeister The compiler is running at about 2000 throws/second.

20:22:42 drmeister So it's unwind bound.

20:23:01 drmeister We may get another factor of 2 or more in compile speed once Bike fixes this.

20:27:15 drmeister Surprising things like loading compiled quicklisp are dominated by the time to unwind.

20:27:54 drmeister We need to get unwinding down to where at most its a few tens to hundred throws/sec.

20:47:58 drmeister Building all of cando's quicklisp now takes 12m36s real and 30m27s user time.

20:48:04 drmeister So almost 3x speedup!

20:53:21 drmeister Kevslinger: How are things going? If you want to talk about the xeos/jupyter thing - I have a bit more time in the next week.

21:16:14 drmeister Bike: Do you ever anticipate optimizing this away? (catch 'something (funcall (lambda () (throw 'something nil)))

21:16:38 drmeister WIll that give me a reliable way of generating a throw - so I can measure the time it takes on a system?

21:19:02 drmeister I want to add something to TIME to print something when we are approaching unwind dominated timing.

21:22:36 Bike i don't think there's much reason to bother with catch throw, no

21:22:59 Bike it has more overhead than block/return-from though

21:23:59 drmeister I don't know how to guarantee a throw with block/return-from - any ideas?

21:25:36 Bike kind of hard to anticipate what i'll try to optimize away in the future.

21:27:10 drmeister Yes - design me a corner case that you will never optimize!

21:29:57 drmeister Compilers are fun.

21:32:37 drmeister I need a better clock. C++ has this std library called 'chrono' - it is pretty cool.

21:59:16 kpoeck (time (dotimes (x 25)(cl-bench.gabriel:run-ctak))) -> Time real(1.032 secs) run(1.032 secs) consed(4866400 bytes) unwinds(304150)

22:00:15 Kevslinger drmeister: can we talk early

22:00:21 Kevslinger Oops, early next week

22:01:28 Bike three hundred thousand unwinds per second, huh

22:03:50 Shinmera does an unwind include a return from a function?

22:04:15 Shinmera or rather -- if you return from a function, is that counted as an unwind?

22:04:35 drmeister Shinmera: No - a regular return is very fast - that's in registers.

22:05:15 drmeister Kevslinger: Sure.

22:06:04 drmeister kpoeck: I'm not sure what's going on there. That shakes my certainty about the time unwinds take.

22:07:33 Shinmera I know very little about the current situation, but I wonder if it would be possible to do an optimistic unwind strategy, where you first unwind as if you were not dealing with C++, and only if you encounter C++ frames fall back to using the exceptions mechanism for unwinding.

22:08:04 Bike I don't think we have a way of identifying C++ frames that doesn't rely on the parts of the exception mechanism that are slow.

22:08:22 Shinmera Well you could identify them by way of identifying lisp frames instead.

22:08:23 drmeister What Bike said.

22:08:42 Bike hmmmmm

22:09:03 Shinmera If you can make lisp frame identification fast you can also identify C++ frames fast.

22:10:07 Bike I wonder. We should be able to get the return addresses reasonably quickly I think. Not sure we can get the function addresses from those quickly, though.

22:10:36 Shinmera could have a form of alternate stack where you mark that when you start a new lisp frame.

22:11:37 Shinmera basically just a static memory array the size of max stack frames where you enter the addresses whenever you start a frame. Adds cost to every call of course, but I wonder if it's really impactful.

22:11:48 Bike86 this stupid client

22:12:07 Bike86 i'll think about it. at the moment any special variable binding or unwind protect means C++ frames so i'm going to focus on eliminating that first, though

22:12:35 Shinmera I see, yeah.

22:15:14 Shinmera Once I get |3b|'s sdf+bmfont libraries working together I'd like to see whether Clasp can run Alloy