freenode/#sicl - IRC Chatlog
Search
0:33:50
fiddlerwoaroof
beach, scymtym: on #slime, there's some talk about using eclector as a backend for some slime features
0:47:07
luis
At this point, I'm wondering if it makes sense to grab user-defined reader macros from cl:*readtable* into the Eclector readtable to be able to play with code that uses such reader macros or if there's a better strategy. (E.g., do it the other way around and inject bits of Eclector into cl:*readtable*)
0:50:09
luis
Also, is there a code walker that takes CSTs as input? It doesn't seem too hard to adapt an existing code walker to use CSTs, but I guess I'm fishing for code walker recommendations as well.
3:46:40
beach
jcowan: I removed any mention of 32-bit platforms for floats, and I removed the suggested non-IEEE formats. I have yet to decide about strings.
3:49:07
beach
luis: The "code walker" I am thinking of for Second Climacs is Cleavir. It takes a CST and converts it to an AST, using a first-class global environment. I am thinking of making an incremental implementation of the first-class global environment protocol so that it would be possible to restart the parser after any top-level form in a file/buffer.
4:50:28
jcowan
beach: you might also want to consider providing IEEE 16-bit floats as short-floats, as there is now some hardware support for them on x86_64 (specifically an instruction that properly rounds a 16-bit float to a 32-bit float). Every float16 value has an exact float32 counterpart.
4:59:41
no-defun-allowed
yeah, there's a lot of research into how many bits evaluation and training of neural networks requires
5:00:22
|3b|
interesting light values range from candle to direct sun, but don't need much precision for the direct sun case
5:01:32
no-defun-allowed
i don't believe there's hardware support except for custom logic, but 8bits is the hip in NNs now
5:02:54
beach
Y'all seem to know a lot. Do you happen to know whether there is an x86 instruction for multiplying two complex double floats?
5:04:15
no-defun-allowed
https://stackoverflow.com/questions/10329903/efficient-complex-arithmetic-in-x86-assembly
5:04:50
jcowan
this page looks relevant: https://stackoverflow.com/questions/10329903/efficient-complex-arithmetic-in-x86-assembly
5:05:00
no-defun-allowed
admittedly, i don't know much complex arithmetic besides the output of bordeaux-fft so i can't really comment on that
5:06:38
|3b|
though possibly not worth it for doubles depending on brand and price of the GPU, since that is a 'pro' feature on some brands :(
5:09:19
|3b|
oddly 16bit float is a 'pro' feature too, since NN like them, so might not be faster than single :/
5:11:38
|3b|
yeah, if latency isn't an issue, you can probably do a lot of sound processing on a GPU :)
5:13:22
beach
I am not planning to use this information any time soon, but it changes how I think about some of my very low-priority projects.
5:14:34
|3b|
GPU are optimized for working on large chunks of data at once, so you tend to want to work on longer segments, and you also have latency for transferring to/from GPU memory, and for going through the API
5:14:42
beach
Another thing that has changed the game is multi-core processors. Producing sound is a highly parallel procedure.
5:15:11
|3b|
yeah, CPU are pretty decent at that sort of thing too, especially if you can use their SIMD features
5:18:22
|3b|
(and just to be clear, when i mentioned latency i mean from starting processing to getting results, so delay when processing a live stream or playing live... once you start though, you should be able to produce a continuous stream without gaps assuming it runs in realtime to start with)
5:20:36
|3b|
so good for offline generation/processing of sound, or realtime generation/processing if an initial delay is OK
5:21:16
|3b|
for other cases, might be OK, but i'd suggest to do some tests before investing a bunch of effort into it :)
5:22:38
|3b|
(graphics, particularly VR, is doing latencies in the 10s of ms range or less, but doesn't have to copy back to CPU to send to another API for output, or at worst goes through some optimiized path in drivers)
5:24:44
|3b|
yeah, probably could get into the few ms range, but i'm not sure exactly without actually trying it
5:25:33
beach
Like I said, I am not going to do anything about this soon, but I'll keep these things in mind when contemplating future work.
5:25:37
|3b|
and also depends on the amount of work, have to be doing a lot of work per sample for a few ms of sound to be a good fiit for GPU
9:10:53
beach
So from looking around a bit, I see that a context switch for a traditional kernel-based operating system takes around a μs (order of magnitude).
9:13:23
beach
But recent Linux definitely can't do it without special kernel modules and other settings.
9:15:02
no-defun-allowed
I don't think Linux will immediately switch from the sound emitter to the mixer to the driver though.
9:16:08
no-defun-allowed
If a kernel/switching-based kernel could trace message passes like that and try to schedule the involved programs sequentially, it might be more lucky though.
9:17:08
no-defun-allowed
I think so, because the context switch might not immidiately run the message recipient.
9:18:20
no-defun-allowed
So, prioritising messages sent to mixers (which PulseAudio does, it has a nice level of -11) and/or jumping immediately into running the recipient process could lower latency.
9:18:57
no-defun-allowed
I imagine the second option might decrease throughput since the sender may have more to say, though.
9:20:59
beach
I am not sure that in a system like CLOSOS sender and receiver would have to be in different processor nor even threads.
9:21:48
no-defun-allowed
No, I don't think that would be needed either -- except for mixing sources though, yes.
9:22:39
no-defun-allowed
But the mixer->driver path could be eliminated, and the source->mixer path could be handled specially by the system hypothetically.
9:23:25
beach
... the scheduler in a traditional OS like Linux, would need to give each process hundreds of μs in order to minimize overhead due to context switches.
9:25:57
beach
So that kind of time slice already has a negative impact on sound synthesis. If there are several ready processes, then we are talking milliseconds, and that is dangerously near the tolerance level for sound.
9:27:18
beach
So each "process" (or rather "thread" in CLOSOS) could be given shorter time slices, thereby lowering response time.
9:27:56
no-defun-allowed
Yes, you could easily get more switches in if only normal registers have to be replaced.
9:28:23
beach
Therefore, even with a dumb scheduler, if there are few enough ready threads, there would be no problem with delays for sound synthesis.
9:29:23
|3b|
i think there are also issues with drivers blocking things in kernel for longer than you might want
9:29:43
no-defun-allowed
I remember there being a comparison of the speeds of various parts of context switches on osdev.org, but I can't find it now.
9:30:24
beach
And do we have reasons to believe that this problem would be less of one in a system like CLOSOS?
9:31:54
|3b|
i think a some of it is being fixed already, so wouldn't have to apply to closos, and some more is due to supporting lots of old/flaky hardware
9:34:53
beach
In half an hour, my lunch guests will arrive and I have to cook for them. So unfortunately, I need to suspend this discussion for several hours. :( Interesting stuff though.
10:58:41
luis
beach: right, Cleavir might be too heavy handed at this point. I'm not looking to have all editing commands operate on the CST/AST or anything like that. But definitely something to consider in the future. Shinmera's Staple seems promising.
11:04:24
Shinmera
Or rather: I leave it up to the user to install the apropriate extensions to Staple do work with it correctly.
11:05:41
Shinmera
Using an Eclector readtable that delegates to CL's current one on unknown reader macros could maybe, perhaps, work somewhat but I can see lots of ways for that to go wrong too.