freenode/#clasp - IRC Chatlog
Search
12:25:46
frgo
::notify drmeister Re MPS and signals: I think we need to change behavior in file src/gctools/interrupt.cc: ADD_SIGNAL( SIGSEGV, "+SIGSEGV+", ext::_sym_segmentation_violation); - Am I right that this estalishes a handler represented by the symbol ext::_sym_segmentation_violation? If so: when using MPS on Linux, we're not allowed to do that. If not: I'd like to know what this line actually does.
13:05:04
frgo
Because MPS on Linux relies on SIGSEGV being not handled by someone else. It uses SIGSEGV to manage memory.
13:06:44
frgo
Yes, it does. But if you install another handler, then this leads to MPS being prevented from doing its job.
13:50:57
Colleen
drmeister: frgo said 1 hour, 25 minutes ago: Re MPS and signals: I think we need to change behavior in file src/gctools/interrupt.cc: ADD_SIGNAL( SIGSEGV, "+SIGSEGV+", ext::_sym_segmentation_violation); - Am I right that this estalishes a handler represented by the symbol ext::_sym_segmentation_violation? If so: when using MPS on Linux, we're not allowed to do that. If not: I'd like to know what this line actually does.
13:54:36
drmeister
My friend at Ravenbrook got back to me and wants to see a backtrace. The machine I'm using is in the Amazon Cloud - so I can give him access as well.
13:58:48
drmeister
It doesn't reproduce the problem with the cases I tried last night - it works on simple cases.
13:59:57
drmeister
Nope - it does - I was using cclasp - it behaves differently. In aclasp it fails like it did last night.
14:06:28
drmeister
I can reproduce the problem and I passed it on to David along with stassats' observation.
14:12:41
frgo
Error >>>>>>>> In file included from /opt/common-lisp/lang/clasp/src/clasp/src/gctools/interrupt.cc:2:
14:12:42
frgo
In file included from /opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/include/llvm/Support/ErrorHandling.h:18:
14:13:31
drmeister
I just realized something - I can create Amazon Cloud machines with Clasp running and give people access to them. Great for debugging.
14:14:01
drmeister
frgo: That was the same problem that you had last night - this is with the new externals-clasp build?
14:16:03
frgo
LLVM_CONFIG_BINARY = "/opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/build-release/bin/llvm-config"
14:17:39
drmeister
Yes - that is all fine - you can remove the EXTERNALS_CLASP_DIR line - that's not used anymore.
14:19:20
drmeister
This is the contents of the llvm/Config directory that I think your system wants:
14:19:49
drmeister
The peculiar thing is that I don't have an llvm-config.h file and I don't see the problem that you do.
14:21:14
frgo
AsmParsers.def.in AsmPrinters.def.in Disassemblers.def.in Targets.def.in abi-breaking.h.cmake config.h.cmake llvm-config.h.cmake
14:22:59
frgo
As soon as you actually build LLVM there is a llvm-config.h there... - in build-release/include/llvm/Config/llvm-config.h
14:24:43
drmeister
That's my /externals-clasp/llvm50/build-release/include/llvm/Config/ directory - and yes there is an llvm-config.h
14:25:24
frgo
No - it's there: AsmParsers.def AsmPrinters.def Disassemblers.def Targets.def abi-breaking.h config.h llvm-config.h
14:26:07
frgo
It's just that the directory ".../externals-clasp/llvm50/include" is not set as an include dir by wscript.
14:29:21
Bike
it seems that the include has to be built. the source only has whatever kind of pre file.
15:32:15
drmeister
I see problems when I try to create >50 threads on OS X and then there are the problems that we ran into on Linux.
15:39:56
drmeister
frgo: I can give you access to the Linux machine that has Clasp built and exhibits the problem - would that help?
15:47:04
frgo
Later on - I am setting up a small app that helps demonstrate the issue. For not havimg to build clasp over and over ;-)
17:25:45
drmeister
It writes out JITted symbols and their addresses and sizes to symbolicate backtraces.
18:58:12
clasper
drmeister: according to this video: https://vimeo.com/216547984 Azul has done a lot of work on the llvm for jited languages
18:58:52
clasper
it is nice to know that they are making the llvm mare amenable to managed languages
19:18:52
drmeister
printf("%s:%d caught signal sig=%d SEGV_ACCERR=%d siginfo_t=%p context=%p \n", __FILE__, __LINE__, sig, SEGV_ACCERR, info, context);
19:22:50
drmeister
So, what am I looking at here? Is sigHandler called because the program touched memory that had a barrier over it?
19:23:22
frgo
and it is ok to get a SIGSEGV. What's the backtrace at that point? As we have a core we should be able to see what happened.
19:31:13
stassats
but they always change, so it's something not doing enough book-keeping, and not search_cache being broken
19:34:17
drmeister
I'm checking to see if that cache is still thread local and being set up properly.
19:34:21
stassats
now that i'm just aborting on bad addresses, i'm always getting faults in search_cache
19:36:34
drmeister
Could it be that search_cache is doing something that is not MPS, moving garbage collection, safe?
19:37:24
drmeister
It's an open hashed hash table of selector keys to effective method functions - for single dispatch C++ methods.
19:37:41
drmeister
It is not location aware (yet) - but I was hoping to get rid of it and use fastgf.
19:37:42
stassats
now i'm faulting at core::DynamicBindingStack::pop_binding , tried a different test case
19:38:43
drmeister
Could you explain a bit more about these fault addresses? These are addresses that the system is trying to read and causing a SIGSEGV?
19:39:32
drmeister
Where do you find the address? In the siginfo_t structure or the context passed to sigHandle in protli.c?
19:40:57
stassats
most of the addresses are pretty low, i'm thinking you are accessing zeroed memory with some offsets
19:43:48
drmeister
The pointer to it (a root) is stored at the top of the stack. This is certainly true for the main thread. I'll recheck the code for threads.
19:43:58
stassats
this->_ThreadLocalBindings.resize(index+1,_NoThreadLocalBinding<T_O>()); does that still happen in mps memory?
19:45:54
drmeister
https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadLocalStacks.h#L33
19:46:19
drmeister
It's a gctools::Vec0<T_sp>. That's a stretchy vector that is maintained with the GC managed memory.
19:47:06
drmeister
https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L14
19:47:33
drmeister
The entire ThreadLocalState is stored in the stack of the thread, right when the thread starts up.
19:50:01
stassats
drmeister: if ThreadLocalState is allocated before mps_root_create_thread_tagged, what happens?
19:53:11
drmeister
Oh sh*t - I don't even have the allocation points initialized when I declare the ThreadLocalState on the stack.
20:12:09
stassats
i'm clearly seeing three threads calling core::MDArray_O::vectorPushExtend with this being 0x7fc24a957020
20:14:14
drmeister
Could you give me the test case that you are using to generate this problem? But it's not completely reproducible - is it?
20:14:54
stassats
(loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))
20:25:23
drmeister
I'm not however - I've been doing fine with this. Could it be the guards that I have in place?
20:31:48
drmeister
I don't understand how three threads can have the same _BFormatStringOutputStream - each thread gets its own.
20:33:58
drmeister
I put in a printf statement right after the _BFormatStringOutputStream is initialized - this is what I get when I run your testcase:
20:38:22
drmeister
next-run-time-module-name is accessing a dynamic variable in a thread unsafe way.
20:41:04
drmeister
I should get rid of that special variable then and every module will have the same name.
20:41:32
drmeister
Although this can't be the true problem because it was happening in aclasp as well - which doesn't do dispatch.
20:42:18
drmeister
What would you recommend for the naming scheme? A thread local name with a counter that is thread local? Or put a lock around the name calculation?
20:46:19
drmeister
So, it's a function that takes a symbol value and does an atomic incf on its symbol-value
20:47:49
drmeister
Now that you mention it, I will check the exact sequence against the one in mpPackage.cc
20:59:25
stassats
my_thread->_BFormatStringOutputStream seems to be indeed different, but the vector to which it push is the same
21:01:30
drmeister
And the main thread doesn't appear to call my_thread->initialize_thread() searching...
21:11:18
drmeister
Hmm, no - that's not it. The ThreadLocalState does not invoke the MPS allocators - so the order is fine and the order I had in start_thread previously was fine.
21:21:32
drmeister
With this: (loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))
21:23:55
stassats
i'm currently using (loop repeat 100 do (mp:process-run-function nil #'(lambda () (catch 'x (eval '(throw 'x 10))))))
21:26:23
drmeister
There are differences - but allocation points are created before allocations take place.
21:29:28
drmeister
That takes more work - and I have to share the start_thread code - that is possibly giving us trouble.
21:30:51
drmeister
The main thread code isn't a simple function like start_thread - it's more scattered and comes in from main(...)
21:44:40
stassats
just need to restrict the number of backtrace frames printed, otherwise i'm losing the numbers out of sight
21:45:19
stassats
crashing at this->_Data->rowMajorAset(idx+this->_DisplacedIndexOffset,newElement);
21:59:27
drmeister
I'm single stepping through the initialize_thread method where the _BFormatStringOutputStream is created - I noticed there is a thread unsafe increment of a counter that I put in that counts whenever a class is allocated - I will make that thread safe using an atomic variable
22:18:00
stassats
(mp:process-run-function nil #'(lambda () (let ((x (make-string-output-stream))) (write-char #\a x) (loop (assert (find #\a (get-output-stream-string x)))))))
22:30:00
stassats
(mp:process-run-function nil #'(lambda () (loop (core:bformat "a%p" 10)))) crashes on its own
22:51:39
drmeister
If you want a simpler test case - I can get it to crash in the interpreter - with no loaded Common Lisp.
22:52:07
drmeister
(let ((c 1000)) (tagbody top (mp:process-run-function nil #'(lambda () (eval '(list 1 2 3 4)))) (setq c (- c 1)) (if (> c 0) (go top))))
22:54:51
drmeister
This means the fault happened in frame #4? Or can the signal handler be caused by a fault in one of the other 6 threads?
22:56:26
drmeister
Interesting - now it created about 100 threads and then failed with a different error
22:59:07
stassats
the thread struct, with local bindings, bformatStringOutputStream, should be a root itself
23:02:29
stassats
even then, &stack_base isn't guaranteed to be the bottom of the stack where you put your other stuff