libera/#sbcl - IRC Chatlog

21:08:06 stassats how come non-tail calls are significantly slower

21:12:18 stassats i barely see any difference between not calling anything and tail calling on M1, but a normal call is really slow

21:12:33 stassats similar things on x86-64, although i do see a difference between not calling and tail-call

21:12:54 stassats so, what prediction hardware are we defeating

21:14:12 edgar-rft ** NICK all

21:14:21 all ** NICK Guest5617

21:14:46 Guest5617 ** NICK edgar-rft

21:21:55 stassats i guess i need to check what clang does before chasing geese

22:26:27 stassats having STR CFP, [CSP]; LDR CFP, [CSP] seems to slow things down

22:36:14 john-a-carroll stassats: thanks for looking at function calls on M1! A little while ago I tried the tak benchmark and was surprised that M1 was slower than x86-64. (However I thought nothing further of it when I found that the M1 version of my system ran so much faster overall)

22:36:42 stassats john-a-carroll: how little?

22:36:58 stassats because m1 function calls are faster for me

22:38:11 stassats anyway, the thing i'm seeing is equally slow on x86-64 and arm64

22:40:03 stassats maybe LDR CFP, [CSP]; LDR CFP, [CFP] is the slow bit, it can't prefetch

22:40:26 stassats although x86-64 uses the same calling convention as C, surely it should be able to

22:40:31 stassats or it's not really the same then

22:43:42 john-a-carroll this was in one of the first m1 releases: 2.1.5 comparing the m1 native version to x86_64 under Rosetta 2. (time (dotimes (n 10000) (tak 18 12 6))) -> 2.4 secs in emulated x86_64, 4.7 seconds in native m1

22:43:57 stassats oh yeah, that's too old

22:44:29 john-a-carroll Ah, OK

22:47:21 stassats currently, rosetta 7.835, m1 5.588

22:51:59 stassats using a different stack is probably not going to help the hardware

22:52:25 stassats i need to concoct a test that uses the C stack

22:53:23 john-a-carroll looks good. I'm still with 2.1.5, and for my system's main benchmark the figures are rosetta 16m46s, m1 15m24s

23:03:18 aeth_ ** NICK aeth

1:10:48 stassats loading and storing the same register on the stack in quick succession doesn't seem to be great

1:11:17 stassats i guess there's no way around that, except for making leaf functions not touch the stack

1:12:08 stassats and supposedly in normal code there's something between function calls and returns

1:14:12 stassats so even spilling an iteration variable onto the stack isn't great

1:19:47 stassats if insert five division instructions between str/ldr, then they do not matter

1:19:57 stassats so, small functions are bad

1:20:15 stassats they -- ldr/str

1:25:15 stassats and it means it's hard to measure performance between different routines in a loop

1:26:35 stassats i mean, with different out of order paths it's always difficult, but here the call/return/save the iteration variable just dominate the computation being measured