libera/#sbcl - IRC Chatlog
Search
3:46:37
hayley
occ: Here is the gist of it in one file: https://plaster.tymoon.eu/view/3736#3736 And the SB-X86-64-ASM::CPUID is just a symbol that the assembler uses.
11:05:09
lukego_
It's interesting trying to decide if code is "optimized." what do you use for reference? I have a little whimsically-coded simulator and the best I can say is that it runs in O(n) time and the constant factor seems to be ~100 cpu cycles per step. Fundamentally each step just does some floating point operations.
11:05:39
lukego_
I'm tempted to guesstimate that "morally" each step is about ten cycles worth of computation and so there might well be an order of magnitude of waste going on at the moment.
11:06:21
hayley
If you can picture the assembly, <https://agner.org/optimize/instruction_tables.pdf>?
11:08:31
lukego_
hayley: it will be fun to reach that level :). I can't immediately picture the fundamental instructions though because I'm a bit of a floating point n00b and using operations like LOG that I don't have a cpu-uarch mental model for yet
11:09:49
lukego_
the fundamental step I'm referring to here is basically "gaussian log likelihood" with double floats. There's probably a much better way to write that function than my quick one. but also likely that a bigger problem is e.g. that the return value is being boxed or something.
11:12:23
lukego_
I can probably grab an order-of-magnitude speedup using e.g. lparallel to apply more cores. but I'll just lose that again when running multiple simulations in parallel. so probably better to focus on single-threaded perf.
11:14:26
lukego_
I think that ideally I'd like about 1000x more performance than what I'm seeing now. Have to scratch my head a bit about how to achieve that. Probably smarter algorithms and/or bigger iron.
11:17:55
lukego_
but anyway I did more than double the speed compared with a couple of hours ago. and on reflection it might be more like 1,000,000x performance that I want and then I'm definitely going to need a whiteboard.
11:19:01
lukego_
(Sorry, I know this is not very entertaining without context, I will strive to make amends somehow someday.)
11:21:50
|3b|
removing and/or understanding all the optimization notes at (speed 3) is where i usually stop
11:21:58
lukego_
actually the context is just this commit https://github.com/permo-io/permo/commit/4699ec7b8c1ab977d2e9246e19fe4e3f7a3fc006
11:23:56
lukego_
yeah I haven't reached that point. I'm also not sure if there's any remedy for boxing double-float return values besides inlining? I may have been a bit optimistic about which bizarre topological concoctions are inline'able in SBCL. I have function A passing lambdas into function B. I want B to inline those lambdas. I try to achieve this by inlining B into A so that everything should be local. dunno if that's sane though.
11:24:55
|3b|
pre-allocate (possibly on stack or something) a typed vector in caller, and pass that in to store the return value
11:26:20
lukego_
I'll push those onto the list of ideas to consider for the next rewrite. This is kind of prototype code more for teasing out ideas than really going to town on.
11:27:16
|3b|
if the function isn't small enough to inline, it probably isn't fast enough to care about call overhead (including boxing return) :)
11:29:33
lukego_
Question on my mind is if/when/how SBCL is able to inline FUNCALL's to objects that a function received as arguments? like (defun foo (inline-me) (funcall inline-me ...)). My working assumption is that it would work in e.g. (flet ((inline-me (...) ...))) (declare (inline foo)) (foo #'inline-me))
11:30:10
lukego_
Exploring that question might be a good excuse to poke my nose into how SBCL works.
11:30:55
|3b|
ACTION wouldn't be surprised if it did, if foo was inlined in the flet, but also wouldn't be surprised if it didn't
11:35:28
lukego_
Or easy to surprise, if you measure surprise in entropy, and more particularly in expected entropy, since then you are expecting the maximum 1 bit of surprise from that 1 bit answer to that question
11:36:04
|3b|
nah, just that particular idea triggers enough 'yes' neurons and 'no' neurons to think both answers are plausible, but only peripherally, so can't say anything definite
11:37:55
lukego_
yeah the role of block compilation is another interesting factor. I had everything wrapped in a SERAPEUM:BLOCK-COMPILE but took that out partly because it confuses emacs about what a defun/top-level-form is
11:38:32
lukego_
Could be that I should just make this a macro that expands into a Fortran AST and then pretty print and compile that. but not entertaining such radical ideas yet.
11:40:48
lukego_
Could also be that I'm doing everything wrong and e.g. my OPTIMIZE DEBUG setting is completely inhibiting inlining or something.
11:41:49
lukego_
|3b|: here's the macroexpansion of LINE: https://gist.github.com/lukego/4746bcc8cf5ec9409411e677808e9d0e
11:42:09
|3b|
actually, i guess one other high-level comment, since i haven't seen it mentioned yet: if you haven't profiled it, you haven't even /started/ optimizing :p
11:43:00
lukego_
indeed. I have profiled it but not especially thoroughly. didn't get rid of all the generic arithmetic yet for example.
11:44:34
lukego_
but yeah, goal here is more to feed my mental model of sbcl compilation, less to really squeeze this code
11:48:30
lukego_
"you declared this inline but I already compiled five calls to it earlier in the file"
11:50:35
lukego_
This is also an embarrassingly parallel problem and it's not yet clear to what extent I should exploit that with threads / ILP / SIMD / CUDA / ...
11:50:36
|3b|
ACTION sees GAUSSIAN-LOG-LIKELIHOOD declared inline locally, before it is defined and with no declaim
11:51:18
|3b|
if you need doubles, then the extent you should use CUDA probably depends on the extent you have someone paying for 'pro' gpus :/
11:52:16
lukego_
oh yeah that's another big question: do I need doubles or not? I'm not sure, I'm using them now because it seems like they are basically free on CPUs w/o SIMD.
11:52:55
lukego_
ordinarily I'd run screaming from single-floats but maybe when working in the log domain it's not such a biggie, haven't thought that through
11:53:46
lukego_
Just idly googling for papers where people GPU-optimized code of this nature, i.e. particle filters, it seemed like they were often happy with one order of magnitude speedup which wouldn't be worth the trouble from my perspective
11:57:57
moon-child
I think shader compilers like to fuck up float math (so eg (x + y) - y might turn into plain x), and sometimes don't have fused mac but simply m+ac (rounding twice)
11:58:59
lukego_
anecdotally: it seems like SBCL /can/ do that INLINE-ME optimization mentioned above. in my simple test it only happened when I globally made the callee inline with DECLAIM and not just locally at the callsite with DECLARE.
11:59:25
moon-child
right but the point is you need it to be faithfully rounded to implement the double-single properly
12:00:17
moon-child
stuff like addition gets the last bit wrong, ok--whatever. But replacing fma with multiply+add means you only get half your bits for the double-single multiply
12:00:19
|3b|
and only time you need local declaration is when you also did declaim notinline after the function definition (or locally notinline in an enclosing lexical scope)