Search
Friday, 9th of November 2018, 21:39:37 UTC
21:41:55
stassats
ok, sigill does land in ldb
21:42:41
stassats
so, a sigill in a foreign thread would not be caught
21:43:26
stassats
i only have 1 core in my vm, let's increase that
21:46:29
asarch
ACTION whispers: "Damn Hotmail!"
21:47:15
stassats
no failure with two cores, but it appears to be even slower
21:52:27
stassats
but openbsd still thinks there's one core
21:52:57
joshe
I see what you mean about being used to the old host-2 output
21:53:07
joshe
I keep thinking it's about to dump the cold core
21:53:41
stassats
when you stare at the same thing for over a decade you get used to it
22:01:41
asarch
Gotcha!: "Blocked for security reasons!"
22:02:21
stassats
can't be sending illegal instructions
22:07:15
joshe
I wonder if the kernel would send SIGILL for other reasons
22:08:04
joshe
oh, I think that's how something with an invalid stack pointer is killed
22:08:35
stassats
why is it surfacing with multiple threads only?
22:08:36
joshe
outside of a region with a special mmap flag
22:08:49
stassats
why doesn't it happen to me?
22:09:46
stassats
do i need an smp kernel to get multiple cores or something?
22:10:37
stassats
i had only one core during installation
22:10:49
asarch_
This is the link from Dropbox: https://www.dropbox.com/s/wcwohmr0km2sdpv/debug.tar.gz?dl=0
22:11:00
joshe
oh, yes if you added cores then you need to change kernels
22:11:32
joshe
mv /bsd /bsd.sp && mv /bsd.mp /bsd
22:12:17
joshe
which is all the installer would do on the next upgrade anyway
22:12:44
stassats
ok, now i need to find bsd.mp
22:12:57
joshe
oh right, it wasn't installed
22:13:35
stassats
i can just download it
22:15:16
joshe
ftp http://cdn.openbsd.org/pub/OpenBSD/$(uname -r)/$(machine)/{bsd.mp,SHA256.sig}
22:15:18
joshe
signify -C -p /etc/signify/openbsd-$(uname -r | tr -d .)-base.pub -x SHA25.sig bsd.mp
22:22:09
joshe
if I'm reading this right, you get killed with an uncatchable SIGILL if the kernel can't write to your stack while delivering a signal
22:22:54
joshe
so where'd the bad RSP value come from?
22:23:00
stassats
that shouldn't be the case, it can write to the stack
22:24:03
joshe
gdb shows no memory mapped where RSP points in the core dump I have
22:25:07
stassats
1 is the main thread, we move the stack without informing the os
22:25:10
joshe
also, you can pkg_add gdb as root to get a less-old gdb installed as 'egdb'
22:31:54
stassats
and it's non deterministic
22:38:55
stassats
well, $rsp is kinda not where the control stack is supposed to be, but it may be just sigaltstack
22:41:16
stassats
but it has no trouble receiving signals
22:45:11
stassats
well, if $rsp is not even dump
22:45:41
stassats
i actually see some output in the console
22:46:20
stassats
trap [sbcl]47135/509372 type 6: sp 220f3fdf8 not inside 220d50000-220f40000 |
22:46:20
stassats
trap [sbcl]39758/313918 type 6: sp 244f77178 not inside 244d88000-244f78000
22:49:38
stassats
(< #X244d88000 #x244f77178 #x244f78000) => T
22:55:31
asarch
Sorry, sorry. dhclient went crazy
22:56:19
stassats
that is pretty cloe to the end, but still, what is it talking about?
22:57:02
stassats
asarch: no need anymore
22:57:57
asarch
Can you fix the problem? :-)
22:59:37
stassats
i already pry into the internals of the thread struct on darwin
22:59:46
joshe
anyway, it's just a test failure
22:59:52
joshe
go ahead and install and use it
23:00:02
stassats
to set the stack boundaries, maybe openbsd needs that as well
23:00:34
stassats
even if MAP_STACK is used, the check appears to be expensive if it's out of specified bounds
23:01:57
stassats
i probably don't have access to p_spstart
23:02:17
stassats
i would assume it's in the kernel
23:04:11
stassats
oh, it updates p->p_spstart
23:04:18
stassats
uvm_map_check_stack_range
23:04:25
stassats
so, it shouldn't be always expensive
23:05:22
stassats
now, it only happens on dualcore, can it be that it's accessing p->p_spstart while it's being updated?
23:06:28
stassats
that would explain (< #X244d88000 #x244f77178 #x244f78000) => T
23:07:10
joshe
I'd guess it'll be covered by the big kernel lock or a process subsystem-wide lock
23:10:34
stassats
and the kernel can't be interrupted, can it?
23:11:40
stassats
well, the process has been running for some time, the main thread p_spstart should have settled down
23:12:06
stassats
and new threads are already born with the right stack locations specified
23:13:56
stassats
and it's fine for the main thread, or for "bsd"
23:23:00
stassats
i have no more new clues
23:23:56
joshe
so, is it correct that sbcl will use a lisp stack allocated out of the heap?
23:24:19
stassats
well, you did change it to use MAP_STACK
23:24:24
joshe
so that RSP might point outside the thread struct?
23:24:38
stassats
no, that shouldn't happen
23:39:15
stassats
threads.pure also fails
23:40:30
stassats
it's say type 6, which is sigsegv
23:40:53
stassats
but the main thread should have received countless sigsegvs by that point
23:50:33
stassats
https://gist.github.com/stassats/1081ccb72c9468f1f7f4c06698b7fc16
23:52:51
joshe
run-program is a known problem on the openbsd side, the fix is to stop disabling PIE
23:53:12
stassats
well, it's something about the environment
23:53:20
stassats
anyway, i blame the OS
23:54:00
joshe
http://openbsd-archive.7691.n7.nabble.com/Wrong-linkage-to-environ-with-Wl-nopie-td338325.html
23:56:41
stassats
i think i would need to recompile the kernel to get more insight
23:57:06
joshe
I've already added a few printfs but they weren't enlightening
23:57:44
joshe
uvm_map_check_stack_range() is returning false because of the uvm_map_lookup_entry() case
23:57:44
stassats
i wanted to know where exactly in https://github.com/openbsd/src/blob/master/sys/uvm/uvm_map.c#L1774 it returns
23:58:15
joshe
maybe the stack page was unmapped somehow?
23:58:49
stassats
well, for sbcl, not for the os to do
23:59:21
joshe
or it ran off the end of the first MAP_STACK page and into a normal data page, and the kernel only noticed on the next signal delivery?
0:00:44
stassats
you're mapping the whole thing map_stack
0:00:51
joshe
anyway, in that file it's returning on line 1788
0:00:52
stassats
there's nowhere it can go outside of it
0:01:45
stassats
what is p->p_vmspace->vm_map.serial?
0:01:59
stassats
so, i assume it gets there by p->p_vmspace->vm_map.serial != p->p_spserial
0:02:01
joshe
would sbcl mprotect() any part of the stack later?
0:02:18
stassats
the guard pages, but that's only on stack overflow, which is not the case here
0:02:46
joshe
u_int serial; /* signals stack changes */
0:03:16
stassats
altstack, that we do use
0:03:29
stassats
but it's inside the thread struct well
0:05:22
joshe
it's a serial number which exists to invalidate those p_sp* members
0:05:47
joshe
it's incremented when MAP_STACK is added or removed
0:06:09
stassats
new threads are MAP_STACKed
0:06:28
joshe
p_spserial should be per-thread
0:06:29
stassats
is that actually needed?
0:07:08
joshe
there's also some magic to automatically apply MAP_STACK to whatever you pass to sigaltstack()
0:09:15
stassats
i would assume pthread_attr_setstack does the job of map_stack, and map_stack is only needed for the main thread
0:20:58
stassats
if you're saying 1788, then sp is completely unmapped, nevermind MAP_STACK
0:25:37
stassats
can it be caused by thread destruction?
0:26:08
joshe
so I added a printf to sbcl of the initial thread struct address
0:26:09
joshe
pid 36644 initial stack 0x290fbe000-0x2914b00b8
0:26:18
joshe
versus the kernel message:
0:26:19
joshe
trap [sbcl]36644/487632 type 6: sp 2cfa37a78 not inside 2cf848000-2cfa38000
0:26:21
stassats
we schedule thread unmapping when another thread dies, but there's a window when a second thread may have just died but the os is still writing something
0:28:59
stassats
something like 88a92a8129b6559c140e16e8d0e01bb7bba56f6a
0:34:15
stassats
it would be good to know what's the PC when that "type 6" sigsegv is delivered
0:41:53
joshe
it's the same as in the core dump, <futex+10>
0:42:42
stassats
well, it's in the main thread, the main thread shouldn't die, should it?
0:42:55
stassats
why is it starting from 1 and not 0?
0:42:57
joshe
if it's not in the initial thread then it must be in a thread struct which was unmapped after the thread finalized
0:43:44
joshe
I think that thread numbering is gdb's, the OS uses random pid-like IDs
0:44:54
stassats
gdb indeed starts from 1
0:45:13
stassats
so, there's no way thread 1 is being deallocated
0:48:01
stassats
a new message syscall [sbcl]83491/452323 sp 254937cd8 not inside 254748000-254938000
0:49:28
joshe
well I'm getting a trap, presumably the page fault from trying to retq with sp pointing to unmapped memory
0:50:26
joshe
how is that happening though, the sp value it's being killed for isn't even inside the main thread's stack
0:55:41
stassats
removed free_thread_struct(post_mortem);, can no longer crash in info.impure.lisp
1:00:49
stassats
we are calling pthread_join, so it shouldn't be existing anymore at all
1:00:53
stassats
before free_thread_struct
1:01:02
stassats
so, how come and why the main thread is receiving it?
1:07:27
stassats
well, i'm still puzzled and will leave it at that for today
1:12:28
joshe
this isn't happening in the main thread
1:12:50
stassats
well, duh, i never believed that anyway
1:13:23
joshe
the thread didn't make it into the core dump
Saturday, 10th of November 2018, 9:39:37 UTC