freenode/#clasp - IRC Chatlog

10:35:43 drmeister Hello everyone

10:36:08 heisig Hello drmeister.

10:36:15 drmeister Hello heisig

10:36:38 drmeister heisig: Do you generate Nvidia Ptx?

10:37:04 heisig Not yet.

10:37:41 drmeister https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

10:37:42 drmeister Ok

10:38:00 drmeister That is the lowest level at which you can program Nvidia boards - right?

10:38:33 drmeister They probably have some kind of microcode - but Ptx is the documented, best low level target.

10:38:49 heisig Yes. Ptx is basically the NVidia GPU assembly language.

10:38:55 drmeister Got it.

10:39:21 drmeister I'd like to adapt cleavir to generate PTX for a subset of Common Lisp syntax.

10:40:19 drmeister ACTION is going to get himself some fast computers.

10:40:19 heisig There is cl-gpu and cl-cuda, which might be interesting starting points.

10:40:33 drmeister Do they generate Ptx?

10:40:51 heisig Let me check...

10:40:56 drmeister Checking...

10:43:33 drmeister It looks like they generate cuda

10:43:41 heisig cl-cuda generates Cuda C code and calls the Nvidia compiler on it. It has nice CFFI wrappers for though.

10:44:10 drmeister Yeah - they will be good to look at - but we can generate Ptx directly.

10:46:55 heisig That would be a huge improvement over the status quo. The question is how far you want to go. Should GPUs be able to signal conditions to the host? Should GPUs be able to run generic functions?

10:47:21 drmeister How much code can you stuff into a kernel?

10:48:17 drmeister I would start small - arithmetic, IF, loops and build from there.

10:49:28 drmeister Some random person on the internet says: 2 million PTX instructions per kernel

10:49:38 heisig GPUs are not particularly good at running general purpose code. A technique I have seen is to have a Lisp interpreter on the GPU, as a fallback for the tricky stuff.

10:50:21 drmeister The threads need to be keep synchronized instruction by instruction - right?

10:50:55 heisig The threads are not synchronized at all.

10:51:12 heisig GPUs have very weak ordering guarantees.

10:51:29 drmeister What does that mean? "weak ordering guarantees"?

10:52:22 heisig On a CPU, you typically have a cache coherence protocol that ensures some order of reads and writes. GPUs are much more liberal when it comes to that.

10:53:24 drmeister Ok, well, stuff to learn. We can always have the MIR set the rules.

10:54:29 drmeister I'm going to reduce the problem of molecular design to some table look ups and matrix multiplications and distance calculations.

10:55:04 drmeister Then I'm going to generate custom kernels on the fly to search for solutions to specific problems.

10:56:02 heisig For scientific computing, it is probably sufficient if you have a function CLASP-CUDA:COMPILE that signals an error if the given lambda expression contains anything but arithmetic functions on floats or fixnums, IF and TAGBODY.

10:58:00 heisig Other projects that might be relevant for such an undertaking: https://github.com/cbaggers/varjo and https://github.com/digego/extempore.

10:58:23 drmeister Sure. With cleavir generic functions it will be very straightforward to write some additional methods and generate code for a new backend.

10:59:19 heisig The former project compiles a subset of Lisp to OpenGL shaders, the latter is (among other things) a high-performance compiler for audio and physics processing.

11:08:51 drmeister Got it - thank you.

11:10:07 drmeister You can't get faster than this with GPU's right? This is the ultimate?

11:10:45 drmeister After this it's FPGA's or ASICS

11:13:04 heisig That depends :-)

11:13:24 heisig If you have the time and resources and a straightforward problem, ASICs or FPGAs are fastest, but typically no one has that much time and resources.

11:14:05 heisig GPUs are pretty damn fast, but when compared for the same power budget, one should not compare a GPU to a CPU, but a GPU to two 18 core CPUs.

11:14:36 heisig In that case, a GPU gives 'just' a factor of 2-5 more performance.

11:20:58 drmeister Really? Just 2-5 when compared to two 18-core CPU's?

11:23:03 drmeister Society is going to need really smart molecules to solve it's problems in the next decades. I think the resources will be available.

11:23:10 drmeister I'll start with GPU's

11:24:20 Shinmera GPUs are good at doing really simple, very low-branching arithmetic in massive parallel

11:24:31 Shinmera They're bad at anything else

11:25:05 drmeister That's fine - that's what I've reduced the problem to.

11:25:53 drmeister Cando is a giant optimizing compiler to reduce chemistry to simple arithmetic.

11:27:48 heisig My colleague (who works on molecular dynamics and with whom I share my office) also recommends GPUs :)

11:35:27 drmeister Oh - I'm not delusional - I'm not selecting between GPU's, FPGA's or ASICs like I have a choice right now.

11:36:07 drmeister I'm talking at NVidia on Friday - I want to get the lay of the high performance landscape.

11:37:46 drmeister Cando has multi-threading - so I can use 18-core CPU's no problem. Clasp's arithmetic isn't so good right now - but I can write custom math (and I do a lot of this) in C++ and run it from Common Lisp.

11:39:31 drmeister Computational chemistry does have a history of using ASIC's though.

11:40:48 drmeister https://www.deshawresearch.com/

11:41:27 drmeister This guy made a lot of money and then had custom ASICs built to simulate molecular dynamics.

11:42:46 drmeister The value of this has been dubious. GPU's are the current sweet spot for performance/effort

11:45:05 heisig Yes, I think you are right. The only problem I see in the long run is that GPU currently means CUDA, which in turn is a proprietary toolchain that puts you at the whim of a single company.

11:45:15 heisig But you shouldn't bring that up at Nvidia :)

11:46:00 drmeister Yes - and is OpenCL an alternative?

11:47:56 heisig To some degree. But it is slightly slower and the toolchain is harder to use.

11:48:29 heisig I should say, slightly slower on Nvidia GPUs when compared to CUDA.

11:57:10 drmeister And there are no good comparisons between OpenCL on other GPU's vs CUDA on Nvidia GPU's because that's a really, really hard thing to compare.

11:57:25 drmeister ... a hard comparison to make.

12:14:20 Shinmera OpenCL has the advantage that you can run it practically everywhere, even on laptops where you typically only have intel cpus

12:14:29 Shinmera err intel gpus

15:19:17 drmeister I’m flying today - I’ll be in and out.

15:20:09 drmeister Bike: how are things going - are you still feeling good about it?

15:25:23 Bike yeah

15:25:26 Bike where are you going exactly

15:28:05 beach Bike: What is it that you are feeling good about?

15:29:00 Bike just some stuff i'm looking at. not inlining, which i'm still confused about

15:29:09 Bike actually i should revert those sicl commits so it at least functions, just slowly

15:29:10 beach Got it.

15:29:34 beach Doesn't matter to me. I am not using inlining yet.

15:33:29 drmeister Bike: speaking at llvm meeting on Thursday and Nvidia Friday.

15:33:38 Bike oh right.

15:33:38 drmeister California

18:22:37 drmeister Hail Atlanta!

18:23:23 drmeister I read the PTX reference manual -pretty straight forward

18:24:32 d4ryus1 ** NICK d4ryus