libera/#sbcl - IRC Chatlog

3:46:37 hayley occ: Here is the gist of it in one file: https://plaster.tymoon.eu/view/3736#3736 And the SB-X86-64-ASM::CPUID is just a symbol that the assembler uses.

6:18:50 gahr ,pom

11:05:09 lukego_ It's interesting trying to decide if code is "optimized." what do you use for reference? I have a little whimsically-coded simulator and the best I can say is that it runs in O(n) time and the constant factor seems to be ~100 cpu cycles per step. Fundamentally each step just does some floating point operations.

11:05:39 lukego_ I'm tempted to guesstimate that "morally" each step is about ten cycles worth of computation and so there might well be an order of magnitude of waste going on at the moment.

11:06:21 hayley If you can picture the assembly, <https://agner.org/optimize/instruction_tables.pdf>?

11:08:31 lukego_ hayley: it will be fun to reach that level :). I can't immediately picture the fundamental instructions though because I'm a bit of a floating point n00b and using operations like LOG that I don't have a cpu-uarch mental model for yet

11:09:49 lukego_ the fundamental step I'm referring to here is basically "gaussian log likelihood" with double floats. There's probably a much better way to write that function than my quick one. but also likely that a bigger problem is e.g. that the return value is being boxed or something.

11:12:23 lukego_ I can probably grab an order-of-magnitude speedup using e.g. lparallel to apply more cores. but I'll just lose that again when running multiple simulations in parallel. so probably better to focus on single-threaded perf.

11:14:26 lukego_ I think that ideally I'd like about 1000x more performance than what I'm seeing now. Have to scratch my head a bit about how to achieve that. Probably smarter algorithms and/or bigger iron.

11:17:55 lukego_ but anyway I did more than double the speed compared with a couple of hours ago. and on reflection it might be more like 1,000,000x performance that I want and then I'm definitely going to need a whiteboard.

11:19:01 lukego_ (Sorry, I know this is not very entertaining without context, I will strive to make amends somehow someday.)

11:21:50 |3b| removing and/or understanding all the optimization notes at (speed 3) is where i usually stop

11:21:58 lukego_ actually the context is just this commit https://github.com/permo-io/permo/commit/4699ec7b8c1ab977d2e9246e19fe4e3f7a3fc006

11:22:09 |3b| (which includes boxing return values)

11:23:56 lukego_ yeah I haven't reached that point. I'm also not sure if there's any remedy for boxing double-float return values besides inlining? I may have been a bit optimistic about which bizarre topological concoctions are inline'able in SBCL. I have function A passing lambdas into function B. I want B to inline those lambdas. I try to achieve this by inlining B into A so that everything should be local. dunno if that's sane though.

11:24:18 |3b| making your own box in advance can help sometimes

11:24:55 |3b| pre-allocate (possibly on stack or something) a typed vector in caller, and pass that in to store the return value

11:26:20 lukego_ I'll push those onto the list of ideas to consider for the next rewrite. This is kind of prototype code more for teasing out ideas than really going to town on.

11:26:29 moon-child log is slow

11:26:58 |3b| yeah, i usually just inline rather than making boxes

11:27:16 |3b| if the function isn't small enough to inline, it probably isn't fast enough to care about call overhead (including boxing return) :)

11:29:33 lukego_ Question on my mind is if/when/how SBCL is able to inline FUNCALL's to objects that a function received as arguments? like (defun foo (inline-me) (funcall inline-me ...)). My working assumption is that it would work in e.g. (flet ((inline-me (...) ...))) (declare (inline foo)) (foo #'inline-me))

11:30:10 lukego_ Exploring that question might be a good excuse to poke my nose into how SBCL works.

11:30:55 |3b| ACTION wouldn't be surprised if it did, if foo was inlined in the flet, but also wouldn't be surprised if it didn't

11:32:54 jackdaniel in other words you are a person who is hard to surprise? :)

11:33:37 moon-child BOO!

11:35:28 lukego_ Or easy to surprise, if you measure surprise in entropy, and more particularly in expected entropy, since then you are expecting the maximum 1 bit of surprise from that 1 bit answer to that question

11:36:04 |3b| nah, just that particular idea triggers enough 'yes' neurons and 'no' neurons to think both answers are plausible, but only peripherally, so can't say anything definite

11:36:33 |3b| ACTION would, for example, be very surprised if it inlined when foo wasn't inlined :)

11:36:58 |3b| (aside from maybe cases involving block compilation)

11:37:55 lukego_ yeah the role of block compilation is another interesting factor. I had everything wrapped in a SERAPEUM:BLOCK-COMPILE but took that out partly because it confuses emacs about what a defun/top-level-form is

11:38:32 lukego_ Could be that I should just make this a macro that expands into a Fortran AST and then pretty print and compile that. but not entertaining such radical ideas yet.

11:38:54 |3b| generate fortran, then compile it with f2cl? :p

11:40:33 |3b| what does the expansion of the definition of LINE look like?

11:40:48 lukego_ Could also be that I'm doing everything wrong and e.g. my OPTIMIZE DEBUG setting is completely inhibiting inlining or something.

11:41:49 lukego_ |3b|: here's the macroexpansion of LINE: https://gist.github.com/lukego/4746bcc8cf5ec9409411e677808e9d0e

11:42:09 |3b| actually, i guess one other high-level comment, since i haven't seen it mentioned yet: if you haven't profiled it, you haven't even /started/ optimizing :p

11:43:00 lukego_ indeed. I have profiled it but not especially thoroughly. didn't get rid of all the generic arithmetic yet for example.

11:44:34 lukego_ but yeah, goal here is more to feed my mental model of sbcl compilation, less to really squeeze this code

11:46:11 |3b| is that -> declaiming things inline?

11:46:40 lukego_ no just declaring function arg/return types

11:47:15 |3b| ok, can't inline things that aren't declaimed inline before defining them

11:47:48 |3b| and can't inline them before they are defined either

11:48:12 lukego_ yeah I saw that SBCL did a good job of flagging those issues with warnings

11:48:30 lukego_ "you declared this inline but I already compiled five calls to it earlier in the file"

11:50:35 lukego_ This is also an embarrassingly parallel problem and it's not yet clear to what extent I should exploit that with threads / ILP / SIMD / CUDA / ...

11:50:36 |3b| ACTION sees GAUSSIAN-LOG-LIKELIHOOD declared inline locally, before it is defined and with no declaim

11:51:18 |3b| if you need doubles, then the extent you should use CUDA probably depends on the extent you have someone paying for 'pro' gpus :/

11:52:16 lukego_ oh yeah that's another big question: do I need doubles or not? I'm not sure, I'm using them now because it seems like they are basically free on CPUs w/o SIMD.

11:52:38 |3b| (and also possibly on how many instances you need to run vs latency)

11:52:55 lukego_ ordinarily I'd run screaming from single-floats but maybe when working in the log domain it's not such a biggie, haven't thought that through

11:53:46 lukego_ Just idly googling for papers where people GPU-optimized code of this nature, i.e. particle filters, it seemed like they were often happy with one order of magnitude speedup which wouldn't be worth the trouble from my perspective

11:54:32 moon-child |3b|, lukego_: you can do double-double ('double-single'?) on commodity gpus

11:56:08 |3b| true

11:56:49 moon-child wait actually can you

11:57:07 |3b| they are programmable, you can do whatever you want

11:57:54 |3b| ACTION 's question is whether it would be faster (and how much) or not

11:57:57 moon-child I think shader compilers like to fuck up float math (so eg (x + y) - y might turn into plain x), and sometimes don't have fused mac but simply m+ac (rounding twice)

11:58:59 lukego_ anecdotally: it seems like SBCL /can/ do that INLINE-ME optimization mentioned above. in my simple test it only happened when I globally made the callee inline with DECLAIM and not just locally at the callsite with DECLARE.

11:59:00 |3b| ACTION would expect fma on any combination of hardware + API worth using

11:59:25 moon-child right but the point is you need it to be faithfully rounded to implement the double-single properly

11:59:28 |3b| right, compiler needs to know it should store info to inline things

11:59:40 |3b| so declaim inline before definition of function

12:00:17 moon-child stuff like addition gets the last bit wrong, ok--whatever. But replacing fma with multiply+add means you only get half your bits for the double-single multiply

12:00:19 |3b| and only time you need local declaration is when you also did declaim notinline after the function definition (or locally notinline in an enclosing lexical scope)

12:00:26 moon-child so it's no better than single precision

12:19:01 |3b| looks like there is probably plenty of room to optimize that in CL though