freenode/#clasp - IRC Chatlog

13:50:57 drmeister Hello

13:50:57 Colleen drmeister: frgo said 1 hour, 25 minutes ago: Re MPS and signals: I think we need to change behavior in file src/gctools/interrupt.cc: ADD_SIGNAL( SIGSEGV, "+SIGSEGV+", ext::_sym_segmentation_violation); - Am I right that this estalishes a handler represented by the symbol ext::_sym_segmentation_violation? If so: when using MPS on Linux, we're not allowed to do that. If not: I'd like to know what this line actually does.

13:52:30 drmeister I'll try this.

13:52:46 frgo Morning drmeister

13:53:18 frgo stassats already pointed out that MPS actually receives SIGSEGV.

13:53:30 stassats and it works on the main thread, so

13:53:42 frgo (I don't have linux here right now to test)

13:53:46 drmeister Ok.

13:54:36 drmeister My friend at Ravenbrook got back to me and wants to see a backtrace. The machine I'm using is in the Amazon Cloud - so I can give him access as well.

13:54:46 drmeister I'm just waking up - need tea.

13:55:04 stassats drmeister: what i've uncovered, it actually receives multiple faults

13:55:12 stassats and resignals them

13:55:17 stassats ending with a fault at zero

13:55:23 drmeister Interesting...

13:55:35 drmeister I built a version with guards on - it took 7 hours.

13:55:42 drmeister It's like the bad old days.

13:55:46 Shinmera Like good old times.

13:56:00 stassats you have different attitudes

13:56:09 Shinmera I'm just sarcastic.

13:58:48 drmeister It doesn't reproduce the problem with the cases I tried last night - it works on simple cases.

13:59:57 drmeister Nope - it does - I was using cclasp - it behaves differently. In aclasp it fails like it did last night.

14:06:28 drmeister I can reproduce the problem and I passed it on to David along with stassats' observation.

14:12:40 frgo drmeister: I seem to have a problem with wscript in clasp: I get:

14:12:41 frgo Error >>>>>>>> In file included from /opt/common-lisp/lang/clasp/src/clasp/src/gctools/interrupt.cc:2:

14:12:42 frgo In file included from /opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/include/llvm/Support/ErrorHandling.h:18:

14:12:42 frgo #include "llvm/Config/llvm-config.h"

14:12:42 frgo ^~~~~~~~~~~~~~~~~~~~~~~~~~~

14:13:00 frgo well, the file *is* there.

14:13:31 drmeister I just realized something - I can create Amazon Cloud machines with Clasp running and give people access to them. Great for debugging.

14:13:46 frgo I don't see script setting the include path for llvm...

14:14:00 frgo s/script/wscript.

14:14:01 drmeister frgo: That was the same problem that you had last night - this is with the new externals-clasp build?

14:14:10 frgo Yes.

14:14:21 drmeister Could you paste your wscript.config file?

14:14:32 frgo Sure.

14:16:03 frgo LLVM_CONFIG_BINARY = "/opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/build-release/bin/llvm-config"

14:16:03 frgo EXTERNALS_CLASP_DIR = "/opt/common-lisp/lang/clasp/src/externals-clasp"

14:16:03 frgo LLVM5_ORC_NOTIFIER_PATCH = True

14:16:03 frgo SBCL = "/usr/local/bin/sbcl"

14:16:03 frgo LTO_OPTION = "thinlto"

14:16:40 frgo The llvm-config binary runs fine.

14:17:39 drmeister Yes - that is all fine - you can remove the EXTERNALS_CLASP_DIR line - that's not used anymore.

14:19:20 drmeister This is the contents of the llvm/Config directory that I think your system wants:

14:19:21 drmeister https://www.irccloud.com/pastebin/DuUWgeSB/

14:19:49 drmeister The peculiar thing is that I don't have an llvm-config.h file and I don't see the problem that you do.

14:20:16 drmeister What does your externals-clasp/llvm50/include/llvm/Config/ directory look like?

14:21:13 frgo Huh? No llvm-config.h? well...

14:21:14 frgo AsmParsers.def.in AsmPrinters.def.in Disassemblers.def.in Targets.def.in abi-breaking.h.cmake config.h.cmake llvm-config.h.cmake

14:22:59 frgo As soon as you actually build LLVM there is a llvm-config.h there... - in build-release/include/llvm/Config/llvm-config.h

14:23:21 drmeister Ah - ok

14:23:23 drmeister Right

14:24:18 drmeister https://www.irccloud.com/pastebin/Uus7cxg4/

14:24:43 drmeister That's my /externals-clasp/llvm50/build-release/include/llvm/Config/ directory - and yes there is an llvm-config.h

14:24:47 drmeister you are missing this?

14:25:24 frgo No - it's there: AsmParsers.def AsmPrinters.def Disassemblers.def Targets.def abi-breaking.h config.h llvm-config.h

14:26:07 frgo It's just that the directory ".../externals-clasp/llvm50/include" is not set as an include dir by wscript.

14:26:15 drmeister What do you get when you type this:

14:26:16 drmeister https://www.irccloud.com/pastebin/J4eews7H/

14:27:08 drmeister path-to-externals-clasp-bin-dir/llvm-config --include-dir

14:28:07 frgo "/opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/include"

14:28:12 frgo and this ok

14:28:25 Bike shouldn't it be the build directory?

14:28:57 frgo Bike: TRUE!

14:29:11 drmeister Yeah - shouldn't it return the ... what Bike said

14:29:21 Bike it seems that the include has to be built. the source only has whatever kind of pre file.

14:29:22 frgo Holy cow. What's happening here ...

14:29:49 drmeister frgo: Do you have the log for the externals clasp build?

14:29:56 drmeister I can generate one and we can diff them.

14:30:17 drmeister Because this looks like an externals-clasp build issue.

14:32:30 frgo No - that's gone

14:33:01 drmeister Could you wipe it out and rebuild it - I'll do the same here and we can compare.

14:33:09 frgo Sure

14:33:52 drmeister I'm going to clone another copy and build that .

14:35:48 drmeister https://www.youtube.com/watch?v=qwnyIOoL-LM

14:36:00 drmeister "Away we go"

14:38:20 frgo This video is blocked.

14:38:39 drmeister Jackie Gleeson singing "Away we go"

15:04:00 drmeister stassats: How did you figure out that it gets the signal multiple times?

15:04:52 drmeister Ah - you hacked protli.c

15:16:30 frgo drmeister: Sent you email with build log for externals-clasp attached.

15:17:12 drmeister Did it create an llvm-config.h?

15:17:29 drmeister Or sorry - what does llvm-config --includedirs return

15:20:08 drmeister llvm-config --includedir

15:21:08 drmeister Because there is no indication in your log of any problem

15:21:43 frgo Ouch. I had misconfigured the path to llvm-config in my wscript.config.

15:31:14 drmeister Ok, once it's fixed use: ./waf configure build_cmps

15:31:45 frgo Thx again. Build is running fine now (356/366) ...

15:31:46 drmeister Although I'm starting to think there may be an issue with MPS and threading.

15:32:15 drmeister I see problems when I try to create >50 threads on OS X and then there are the problems that we ran into on Linux.

15:32:16 frgo There is. The way signals are delivered to threads.

15:32:50 frgo As we have pthreads we need to use pthread_kill() to send signals to threads.

15:33:31 frgo SIGSEGV is currently to delivered to threads in a pthreads-safe way.

15:33:45 frgo I just was looking into this.

15:34:57 drmeister Ah - excellent - please don't let me interrupt you.

15:35:41 frgo ;-) hehe - you interrupt me? What am I doing all day long with you?

15:39:56 drmeister frgo: I can give you access to the Linux machine that has Clasp built and exhibits the problem - would that help?

15:47:04 frgo Later on - I am setting up a small app that helps demonstrate the issue. For not havimg to build clasp over and over ;-)

16:11:24 drmeister I have to run an errand for an hour - I'll be back after that.

16:12:18 drmeister A friend is giving us a fruit (persimmon) tree - we need to pick it up.

17:19:48 stassats i have /tmp filled with clasp-log-14357

17:25:12 drmeister Ah - yeah - that's a build feature for debugging - I'll turn that off.

17:25:45 drmeister It writes out JITted symbols and their addresses and sizes to symbolicate backtraces.

18:16:01 drmeister The Ravenbrook folks sound pretty busy - we may be on our own for a while.

18:16:12 drmeister I'm reading MPS documentation again...

18:58:12 clasper drmeister: according to this video: https://vimeo.com/216547984 Azul has done a lot of work on the llvm for jited languages

18:58:52 clasper it is nice to know that they are making the llvm mare amenable to managed languages

18:59:12 clasper mare/more

18:59:51 drmeister clasper - thanks - I'm watching it now.

19:18:51 drmeister I inserted this into sigHandler:

19:18:52 drmeister printf("%s:%d caught signal sig=%d SEGV_ACCERR=%d siginfo_t=%p context=%p \n", __FILE__, __LINE__, sig, SEGV_ACCERR, info, context);

19:19:05 drmeister I get this:

19:19:36 drmeister https://www.irccloud.com/pastebin/Y3UPJ9Uj/

19:20:19 drmeister Why is SEGV_ACCERR=2? I thought it should be 11 (SIGSEGV)

19:20:41 frgo That's looking good. 0x7f.... addresses are on the stack.

19:20:57 drmeister This is on linux.

19:21:04 drmeister Does what you say still hold?

19:21:13 frgo So, these are pointers. Yes, it does.

19:22:50 drmeister So, what am I looking at here? Is sigHandler called because the program touched memory that had a barrier over it?

19:23:22 frgo and it is ok to get a SIGSEGV. What's the backtrace at that point? As we have a core we should be able to see what happened.

19:23:38 frgo Can't say without the backtrace.

19:24:09 frgo As stassats did: we'd need to look at what siginfo is telling us.

19:25:21 drmeister This is all I get from the backtrace when I look at the core file:

19:25:23 drmeister https://www.irccloud.com/pastebin/UkOKjgOe/

19:26:19 frgo Ok, so the trap that made us get a SIGSEGV is in another thread.

19:26:48 frgo We need to run lldb and execute cclasp in it.

19:28:02 stassats that will end badly

19:28:08 frgo ?

19:28:24 drmeister I don't get the same behavior when running within a debugger.

19:28:51 drmeister Also - I haven't gotten lldb to run on linux - it wants the llvm server running.

19:28:58 drmeister I use gdb on linux

19:29:00 stassats i'm preventing it from resignalling, the fault is at 0x88

19:29:02 frgo Oh.

19:29:48 stassats cmp r14,QWORD PTR [rsi+0x8]

19:29:51 stassats the instruction

19:30:09 stassats p $rsi

19:30:10 stassats $1 = 128

19:30:35 stassats the function is core::Cache_O::search_cache(core::CacheRecord*&):

19:31:13 stassats but they always change, so it's something not doing enough book-keeping, and not search_cache being broken

19:31:15 drmeister Ah - ok. That should be a thread local cache.

19:31:55 stassats now a fault at 0x7f4032e43ea8

19:32:03 drmeister But it's an old ECL style dispatching cache that I use for C++ messages.

19:32:22 stassats gctools::smart_ptr<core::T_O>::nilp (this=0x7f4032e43ea8) called by search_cache

19:33:54 stassats now a fault in 0x1c8, search_cache again

19:34:17 drmeister I'm checking to see if that cache is still thread local and being set up properly.

19:34:21 stassats now that i'm just aborting on bad addresses, i'm always getting faults in search_cache

19:35:18 drmeister Could it be that search_cache is doing something not thread safe?

19:35:41 stassats there's only one thread really running

19:36:34 drmeister Could it be that search_cache is doing something that is not MPS, moving garbage collection, safe?

19:36:53 stassats not happening on the main thread

19:37:24 drmeister It's an open hashed hash table of selector keys to effective method functions - for single dispatch C++ methods.

19:37:41 drmeister It is not location aware (yet) - but I was hoping to get rid of it and use fastgf.

19:37:42 stassats now i'm faulting at core::DynamicBindingStack::pop_binding , tried a different test case

19:38:07 stassats the fault address is 0x1fc

19:38:43 drmeister Could you explain a bit more about these fault addresses? These are addresses that the system is trying to read and causing a SIGSEGV?

19:39:23 stassats https://en.wikipedia.org/wiki/General_protection_fault

19:39:32 drmeister Where do you find the address? In the siginfo_t structure or the context passed to sigHandle in protli.c?

19:39:38 drmeister Reading... thank you.

19:40:04 stassats in siginfo

19:40:57 stassats most of the addresses are pretty low, i'm thinking you are accessing zeroed memory with some offsets

19:42:29 stassats are you registering your binding stack with mps?

19:42:51 stassats does it know about the values it is holding?

19:43:07 drmeister The binding stack is allocated within MPS memory.

19:43:48 drmeister The pointer to it (a root) is stored at the top of the stack. This is certainly true for the main thread. I'll recheck the code for threads.

19:43:58 stassats this->_ThreadLocalBindings.resize(index+1,_NoThreadLocalBinding<T_O>()); does that still happen in mps memory?

19:44:23 drmeister Checking...

19:45:54 drmeister Yes, it must be in MPS memory. _ThreadLocalBindings are defined here:

19:45:54 drmeister https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadLocalStacks.h#L33

19:46:19 drmeister It's a gctools::Vec0<T_sp>. That's a stretchy vector that is maintained with the GC managed memory.

19:46:32 drmeister It's like std::vector<xxx> but it works in the GC managed memory.

19:46:51 drmeister The DynamicBindingStack is defined here:

19:47:06 drmeister https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L14

19:47:33 drmeister The entire ThreadLocalState is stored in the stack of the thread, right when the thread starts up.

19:47:40 drmeister For the child threads that is here:

19:47:54 drmeister https://github.com/drmeister/clasp/blob/dev/src/core/mpPackage.cc#L109

19:48:35 drmeister The base of the stack is defined just before the my_thread_local_state.

19:49:00 drmeister And the stack is registered using mps_root_create_thread_tagged here:

19:49:07 drmeister https://github.com/drmeister/clasp/blob/dev/src/core/mpPackage.cc#L124

19:49:22 drmeister I'm doing this for my benefit mostly to document that I'm doing this correctly.

19:50:01 stassats drmeister: if ThreadLocalState is allocated before mps_root_create_thread_tagged, what happens?

19:50:28 drmeister Hmmm, that might be a problem.

19:51:05 drmeister The ThreadLocalState constructor does allocate memory using MPS calls...

19:53:11 drmeister Oh sh*t - I don't even have the allocation points initialized when I declare the ThreadLocalState on the stack.

19:53:16 drmeister Rearranging...

19:54:07 drmeister I think the order will be...

19:54:24 stassats another test, now i have three threads crashing at 0x240

19:56:00 stassats all coming from NEXT-RUN-TIME-MODULE-NAME...->vectorPushExtend

19:56:08 drmeister I changed the code to this...

19:56:21 drmeister https://www.irccloud.com/pastebin/xvmgyUzQ/

19:56:41 stassats all three threads vectorPushExtend into the same thing

19:57:00 drmeister I was seeing it crash in there as well.

19:57:03 stassats the same string-output-stream, i assume

19:57:11 drmeister vectorPushExtend

19:57:30 drmeister Compiling and linking...

20:01:35 drmeister I'm still getting segfaults

20:02:23 drmeister My backtraces are a lot less informative for the past couple of hours:

20:02:24 drmeister https://www.irccloud.com/pastebin/tz8J22rJ/

20:03:47 drmeister Can I get you any info that might be helpful?

20:05:12 stassats so, where is that string-output-stream coming from?

20:05:57 stassats if (destination.nilp()) {

20:05:57 stassats output = my_thread->bformatStringOutputStream();

20:06:02 stassats that's probably it

20:07:13 drmeister Is there anything I can tell you about that?

20:07:26 stassats why do multiple threads share it?

20:07:52 stassats why does a single share it?

20:07:53 drmeister Investigating and trying to remember...

20:10:40 drmeister Every thread gets its own _BFormatStringOutputStream

20:10:47 drmeister So they don't share it.

20:11:00 drmeister https://github.com/drmeister/clasp/blob/dev/src/gctools/threadlocal.cc#L208

20:12:08 drmeister It's just a thread local string-output-stream for (core:bformat nil ...)

20:12:09 stassats i'm clearly seeing three threads calling core::MDArray_O::vectorPushExtend with this being 0x7fc24a957020

20:12:39 drmeister Hmmm.

20:12:56 drmeister So they are all stomping on the same string-output-stream.

20:13:20 stassats from bformat

20:14:14 drmeister Could you give me the test case that you are using to generate this problem? But it's not completely reproducible - is it?

20:14:54 stassats (loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))

20:15:12 stassats i don't think that's the bformat call, some other call is actually causing

20:15:53 drmeister And you are using cclasp - starting it with iclasp-mps ?

20:16:01 stassats yes

20:18:16 drmeister Here's what I see - I don't think I'm reproducing what you see:

20:18:17 drmeister https://www.irccloud.com/pastebin/PeW69zgE/

20:19:17 stassats you don't have an abort in sigHandle

20:19:21 drmeister But I have guards on - I should rebuild with them off

20:19:27 drmeister What does your sigHandle look like?

20:19:55 stassats a print and an abort after it doesn't know how to handle it

20:20:27 drmeister For curiosity sake: What do you print?

20:20:46 stassats the address, naturally

20:21:52 drmeister So - like this?

20:21:53 drmeister https://www.irccloud.com/pastebin/OqOr4uV4/

20:22:34 stassats that's not after it doesn't know how to handle it

20:22:35 drmeister Or the abort is below the if(info->si_code == SEGV_ACCERR) {...} block

20:23:28 drmeister Right - so this:

20:23:30 drmeister https://www.irccloud.com/pastebin/Asi6wQPE/

20:24:05 stassats don't print the address before

20:24:09 stassats you'll drown in it

20:25:23 drmeister I'm not however - I've been doing fine with this. Could it be the guards that I have in place?

20:25:37 drmeister I only see 2 or zero print statements.

20:25:49 drmeister I'll move it down nonetheless.

20:27:16 drmeister I'm compiling this now:

20:27:17 drmeister https://www.irccloud.com/pastebin/DyA4ToHP/

20:29:32 drmeister The spacing is weird because I'm running this on a AWS machine.

20:31:48 drmeister I don't understand how three threads can have the same _BFormatStringOutputStream - each thread gets its own.

20:33:58 drmeister I put in a printf statement right after the _BFormatStringOutputStream is initialized - this is what I get when I run your testcase:

20:34:00 drmeister https://www.irccloud.com/pastebin/YPHMjrQP/

20:35:42 stassats that's not the bformat that's actually failing

20:35:49 stassats for me

20:36:09 stassats i have failures without calling bformat directly

20:36:14 stassats just from NEXT-RUN-TIME-MODULE-NAME

20:38:22 drmeister next-run-time-module-name is accessing a dynamic variable in a thread unsafe way.

20:38:30 drmeister *run-time-module-counter*

20:39:02 stassats do the names have to be unique?

20:39:16 drmeister No

20:39:25 stassats then it's safe enough for this exercise

20:39:42 drmeister But there should only be one compilation going on here.

20:40:16 stassats it's from dispatch

20:40:38 drmeister Ah - well, that can come from anywhere.

20:41:04 drmeister I should get rid of that special variable then and every module will have the same name.

20:41:27 stassats not a good naming scheme

20:41:32 drmeister Although this can't be the true problem because it was happening in aclasp as well - which doesn't do dispatch.

20:42:18 drmeister What would you recommend for the naming scheme? A thread local name with a counter that is thread local? Or put a lock around the name calculation?

20:43:45 stassats atomic-incf

20:44:37 drmeister Ok, I'll have to write that.

20:46:19 drmeister So, it's a function that takes a symbol value and does an atomic incf on its symbol-value

20:46:21 stassats does the main thread call start_thread too?

20:46:53 drmeister No, it doesn't. It has it's own complicated startup and shutdown process.

20:47:28 stassats does it repeat all the same stuff, dynamic bindings, mps registration?

20:47:49 drmeister Now that you mention it, I will check the exact sequence against the one in mpPackage.cc

20:48:41 drmeister I'll just get atomic_incf implemented and building and get right on it.

20:48:57 stassats postpone atomic_incf

20:49:02 stassats it won't solve anything

20:49:21 drmeister Ok.

20:50:36 drmeister Checking the order now...

20:59:25 stassats my_thread->_BFormatStringOutputStream seems to be indeed different, but the vector to which it push is the same

21:00:47 drmeister There are a couple of differences.

21:01:01 drmeister The main thread initializes the allocation-points too late.

21:01:30 drmeister And the main thread doesn't appear to call my_thread->initialize_thread() searching...

21:02:07 drmeister Ah - no - it does - checking the order relative to everything else.

21:11:18 drmeister Hmm, no - that's not it. The ThreadLocalState does not invoke the MPS allocators - so the order is fine and the order I had in start_thread previously was fine.

21:11:34 drmeister Which could be why rearranging things didn't make any difference.

21:11:53 drmeister Double checking my work and thinking...

21:18:45 drmeister stassats: In your core dumps - do you see more than two threads?

21:20:56 stassats when i created more than two

21:21:32 drmeister With this: (loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))

21:21:47 drmeister I only tried it once - I only got two threads in the core dump

21:23:55 stassats i'm currently using (loop repeat 100 do (mp:process-run-function nil #'(lambda () (catch 'x (eval '(throw 'x 10))))))

21:25:56 drmeister I've double checked the order of startup - everything appears fine.

21:26:23 drmeister There are differences - but allocation points are created before allocations take place.

21:26:45 drmeister I'm rearranging the start_thread code to exactly mirror the main thread code.

21:27:05 stassats just share the code?

21:28:13 drmeister I will do that next/now.

21:29:28 drmeister That takes more work - and I have to share the start_thread code - that is possibly giving us trouble.

21:29:40 drmeister The main thread code is more convoluted and does a lot more stuff.

21:30:51 drmeister The main thread code isn't a simple function like start_thread - it's more scattered and comes in from main(...)

21:42:08 stassats vectorPushExtend is called on (core::MDArray_O *) 0x7fc3286e9020

21:42:46 stassats i forget where i was going with thet

21:43:12 drmeister It happens to the best of us.

21:44:26 drmeister Is there a way to remove any memory protection on an address?

21:44:40 stassats just need to restrict the number of backtrace frames printed, otherwise i'm losing the numbers out of sight

21:44:52 stassats ok, these two threads push to different strings

21:44:57 drmeister Oh wait - wrong address

21:45:03 stassats but crash at the same place and at the same address

21:45:19 stassats crashing at this->_Data->rowMajorAset(idx+this->_DisplacedIndexOffset,newElement);

21:46:33 drmeister Is there anything I can tell you about that?

21:51:05 stassats this time the crash is at call QWORD PTR [rax+0x1a0]

21:51:17 stassats the vtable?

21:51:55 drmeister I don't know -

21:52:22 stassats it comes from the argument to the function

21:52:34 stassats anyhow, the stream is corrupted, it appears

21:53:39 drmeister This is the _BFormatStringOutputStream?

21:54:26 stassats yes

21:59:27 drmeister I'm single stepping through the initialize_thread method where the _BFormatStringOutputStream is created - I noticed there is a thread unsafe increment of a counter that I put in that counts whenever a class is allocated - I will make that thread safe using an atomic variable

21:59:50 drmeister But nothing depended on it.

22:01:11 stassats ok, i don't think there's any concurrent access happening

22:01:33 stassats just the gc not properly managing thread's memory

22:02:00 drmeister So the MPS has a problem?

22:02:12 stassats you have a problem with mps

22:02:18 drmeister ACTION finally goes there

22:02:33 drmeister I have a problem with the MPS or the MPS has a problem?

22:02:50 drmeister How did you reach that conclusion?

22:03:15 stassats took all inputs, ran the neural network on it

22:03:29 drmeister The one behind your eyes?

22:03:56 stassats you're not properly setting it up

22:04:20 drmeister Ok - any ideas where/how/what I'm doing wrong?

22:04:30 stassats is it able to stop the world correctly?

22:04:51 drmeister The MPS can - yes, there is a function call for that - would that help?

22:12:24 Bike wait, there is?

22:12:49 drmeister There is - what? A function to stop the world... I'm looking for it.

22:13:13 stassats it doesn't always crash, so at least something is working

22:13:28 stassats but it regularly crashes at bformatStringOutputStream

22:13:45 stassats suggesting it's either not pinned down or improperly allocated

22:17:51 stassats and i have a better test case

22:18:00 drmeister Ok

22:18:00 stassats (mp:process-run-function nil #'(lambda () (let ((x (make-string-output-stream))) (write-char #\a x) (loop (assert (find #\a (get-output-stream-string x)))))))

22:18:03 stassats (gctools:garbage-collect)

22:18:08 stassats The assertion (FIND #\a (GET-OUTPUT-STREAM-STRING X)) failed

22:23:00 stassats wait, get-output-stream-string should clear any characters, why is there a delay

22:23:21 drmeister Oh yeah.

22:23:30 stassats ok, bad test case, i don't have a way to check the string without clearing it

22:23:57 drmeister Do you want a get-output-stream-string-dont-clear ?

22:24:09 stassats not really

22:25:34 drmeister Put the write-char in the loop?

22:26:53 stassats ok, everything appears to work

22:27:38 drmeister What's that?

22:28:15 stassats whatever's working works

22:28:22 stassats i haven't found a non working piece of code

22:28:33 stassats other than the original test case with bformat

22:29:22 stassats btw, can't exit with threads running, mps throws a fit

22:29:50 drmeister I don't kill threads properly yet.

22:29:58 drmeister It needs that - right>?

22:30:00 stassats (mp:process-run-function nil #'(lambda () (loop (core:bformat "a%p" 10)))) crashes on its own

22:31:22 stassats huh, no, that's still from CREATE-RUN-TIME-MODULE-FOR-COMPILE

22:32:28 stassats huh

22:32:31 stassats it's the main thread

22:33:07 stassats the main thread crashes

22:33:13 stassats when printing COMMON-LISP-USER

22:33:20 stassats >

22:35:21 stassats now the crash is at 0xfffffffffffffff0

22:36:01 stassats i saw in some disassembly output mov reg, 0xffffffffffffffff asl reg,4

22:36:10 stassats it struck me as very peculiar

22:37:08 stassats like, why would a c compiler even produce such a sequence

22:37:33 drmeister Where?

22:37:51 stassats i don't remember

22:39:06 drmeister Do you have any recommendations at this point?

22:40:11 stassats looking at how the thread allocation points are initialized

22:46:14 stassats different test cases crash in different places

22:51:39 drmeister If you want a simpler test case - I can get it to crash in the interpreter - with no loaded Common Lisp.

22:52:07 drmeister (let ((c 1000)) (tagbody top (mp:process-run-function nil #'(lambda () (eval '(list 1 2 3 4)))) (setq c (- c 1)) (if (> c 0) (go top))))

22:53:10 drmeister I'm doing this now with guards on and using ./waf build_imps_d (debugging -O0)

22:54:13 drmeister So - I have 7 threads and thread 1 starts with a * next to it.

22:54:16 drmeister It has this backtrace:

22:54:18 drmeister https://www.irccloud.com/pastebin/SGe4EsVB/

22:54:51 drmeister This means the fault happened in frame #4? Or can the signal handler be caused by a fault in one of the other 6 threads?

22:55:13 stassats no, frame 4

22:56:26 drmeister Interesting - now it created about 100 threads and then failed with a different error

22:56:37 stassats yeah, i think we can stop with test cases

22:57:01 drmeister Why - do you have a best one? Or some other idea?

22:57:44 stassats they all point to the same problem — memory not being handled properly

22:58:54 drmeister Ok. There is something else to do - ASLR

22:59:01 drmeister ALSR

22:59:06 drmeister One of those - turn it off

22:59:07 stassats the thread struct, with local bindings, bformatStringOutputStream, should be a root itself

22:59:41 drmeister So - putting it on the stack is a bad idea?

22:59:51 drmeister I can make it a root itself.

22:59:53 stassats is it on the stack?

23:00:02 drmeister Yes, I put it on the stack.

23:00:15 stassats Process_O* my_claspProcess = (Process_O*)claspProcess;

23:00:15 stassats Process_sp process(my_claspProcess);

23:00:16 stassats void* stack_base = &stack_base;

23:00:16 drmeister I thought it was a cheap and easy way to make it a root and thread local

23:00:25 stassats is your &stack_base really the stack_base?

23:00:44 stassats is process below or abouve it?

23:00:52 drmeister That's what I pass to mps_root_create_thread_tagged

23:01:22 drmeister Process_O is allocated in the MPS memory - uh hang on...

23:01:25 stassats should've used at least &process

23:02:03 drmeister Wait - Process_O isn't allocated by the thread I don't think. Checking...

23:02:29 stassats even then, &stack_base isn't guaranteed to be the bottom of the stack where you put your other stuff

23:03:17 drmeister I thought it was - or do you mean things can be reordered.

23:03:19 drmeister Oh shit.

23:03:42 drmeister Right - it's in the frame - but where in the frame relative to the ThreadLocalState is anybodies guess.

23:04:26 drmeister That's a difference between the main thread and the child threads setup!

23:04:43 drmeister The bottom of the stack is declared in an outer function.

23:05:26 stassats could be defeated with inlining

23:05:39 drmeister Hang on - I can check this... but what I'm doing is certainly unpredictable.

23:05:55 drmeister I could do a notinline - or what do you recommend?

23:06:30 drmeister The MPS docs must say something about this.

23:06:44 stassats the most robust? get the stack bounds from pthreads

23:07:09 drmeister Ok, hang on. Let's walk through this...

23:07:20 drmeister If you will indulge me.

23:07:21 stassats https://stackoverflow.com/questions/1102049/order-of-local-variable-allocation-on-the-stack?answertab=votes#tab-top

23:08:13 drmeister process-run-function is here and it calls Process_O::make_process

23:08:56 drmeister That allocates the Process_O in the parent thread.

23:09:08 drmeister Then process->enable() is called.

23:09:52 drmeister Sorry - the Process_O::make_process call is here: https://github.com/drmeister/clasp/blob/dev/src/core/mpPackage.cc#L283

23:10:13 drmeister Process_O::enable() is what calls pthread_create: https://github.com/drmeister/clasp/blob/dev/include/clasp/core/mpPackage.h#L140

23:10:25 drmeister pthread_create calls start_thread.

23:11:38 drmeister In start thread I declare stack_base on the stack - but who knows where that ends up on the stack relative to the ThreadLocalState - which must, MUST be above it (on X86_64 below it because the stack grows down).

23:12:05 drmeister Ok, so get the stack from pthreads - looking that up

23:12:29 stassats wait, i'll show it to you

23:13:02 stassats drmeister: https://github.com/sbcl/sbcl/blob/master/src/runtime/thread.c#L502

23:13:27 stassats i'm trying it now

23:19:48 drmeister The Ravenbrook MPS docs call this the "cold end of the stack" - nice.

23:19:51 drmeister http://www.ravenbrook.com/project/mps/master/manual/html/topic/root.html

23:21:03 drmeister They say you find it the way I do it for the main thread - in a function that calls the function that calls mps_root_create_thread_tagged and that you should ensure that the inner function that calls mps_root_create_thread_tagged is not inlined.

23:21:28 drmeister I think I got lucky with the main thread - checking...

23:23:25 stassats another thing, can gc hit during start_thread?

23:23:44 stassats so that before you are even registering roots it doesn't blow off claspProcess

23:24:13 drmeister I'll check that in a moment.

23:25:13 drmeister I'll add __attribute__((noinline)) to initializeMemoryPoolSystem and initializeBoehm

23:26:27 drmeister Boehm seems to figure out the thread stacks on its own.

23:27:29 drmeister I have all of the Boehm code that would call the API to set the stack base commented out and I recall that Boehm was pretty smart about this stuff.

23:27:54 stassats hurray, i can load slime with :spawn

23:28:16 drmeister I guess that means it worked?

23:28:20 stassats yes

23:28:38 stassats the test cases no longer crash

23:28:53 drmeister Freaking awesome! Thank you!

23:29:18 stassats so, what i think should happen to be totally robust, no allocation before mps_root_create_thread is called

23:29:20 drmeister So what did you do to find the cold end of the stack? I was girding my loins to copy that code you pasted from sbcl.

23:29:45 stassats all allocations is then coming from a noninlined function called after mps_root_create_thread

23:29:48 drmeister Ok, I am certain that I don't do any allocation before mps_root_create_thread is called.

23:29:52 stassats that way you can grab &address

23:30:00 drmeister I double, triple checked that earlier today.

23:30:15 stassats drmeister: i copied the code from sbcl

23:30:51 stassats drmeister: and Process_O* my_claspProcess = (Process_O*)claspProcess;?

23:30:56 stassats where is claspProcess allocated?

23:31:09 drmeister claspProcess is allocated in the parent thread.

23:31:33 stassats and if it exits before certain mps_root_create_thread runs?

23:31:39 stassats s/certain//

23:31:52 stassats or unwinds

23:31:57 drmeister Good point - how do I ensure that?

23:32:20 stassats sbcl stops the gc during thread creation

23:32:44 drmeister Ok.

23:32:50 drmeister There's an API for that.

23:33:07 drmeister parking the arena or something - I've used it for debugging.

23:33:37 stassats since you can probably avoid allocating claspProcess, but you can't really avoid passing the function

23:33:53 stassats so either way you need to pin things down before mps_root_create_thread

23:34:53 stassats you don't really have to disable the gc to do that

23:35:19 stassats you can use a mutex or something to wait until the child thread is initialized

23:35:25 stassats and not exit the parent function

23:35:49 drmeister Ok, so (1) copy the code from sbcl to find the cold end of the stack. (2) move everything that allocates into a separate noinline function and call that from start_thread

23:36:02 stassats if you do 2, you don't really need 1

23:37:11 stassats unless you have a rogue c compiler who insists on inlining everything

23:37:11 drmeister Just having a pointer on the stack to a thing pins it down. So I just need to make sure the thread has initialized before I leave the Process_O::enable()

23:37:44 drmeister Right - option (2) is like what the MPS docs suggest and then I don't need to do (1).

23:38:09 stassats and (3) use a mutex to hold the parent thread's horses

23:38:45 stassats or a semaphore, or a condition variable, what's in vogue these days

23:39:01 drmeister A mutex? Or a condition variable? I need some way of telling the parent thread that the child thread is ready to go.

23:39:16 drmeister Right - I'll figure it out.

23:39:56 stassats they're all the same under the hood

23:40:00 drmeister Besides getting this working - is it a reasonable idea to put the ThreadLocalState struct at the top of the stack like this?

23:40:26 drmeister Or should I declare it as a collection of roots. I know I need to get the cold end of the stack correctly regardless.

23:40:45 stassats if it's an a separate function, should be fine

23:41:25 drmeister Is it a reasonable idea to use conservative GC to keep the ThreadLocalState alive.

23:41:59 drmeister I guess the answer is "it's not a terrible idea".

23:42:16 stassats doesn't really matter how you enliven it

0:26:27 drmeister I went with option (2) and I declared: std::atomic<ProcessPhase> _Phase; and I use a spin lock in Process_O::enable() to check for when the process becomes Active.

0:26:48 drmeister It works better I'm still getting weird problems.

0:27:07 drmeister It's not segfaulting however:

0:27:09 drmeister https://www.irccloud.com/pastebin/YcoMrRjd/

0:27:53 drmeister You don't see any problems at all?

0:28:42 stassats none

0:29:07 stassats but any other problems can be due to thread safety

0:30:09 drmeister So there may be unsafe code lurking in there like the thing for naming modules.

0:31:10 stassats one bug at a time

0:31:57 stassats push the stack changes, i'll review them tomorrow, or whenever you push them

0:32:47 drmeister I will - I'm just going through them - I did some on the linux machine and some on my machine - I have to get everything resolved.

0:33:19 stassats enough clasp for me for the day

0:33:28 drmeister Thank you so, so much.

0:42:32 drmeister I pushed the changes through to dev and testing

0:49:29 Bike so the problem was the arcane TLS stuff?

0:49:46 stassats not really

0:50:08 stassats not telling mps where the whole stack is

0:51:22 Bike ah.

0:51:49 stassats the usual &stuff

1:10:05 drmeister Hi - I'm back from a walk - it's a beautiful night.

1:10:33 drmeister Yeah - you need to register the "cold end of the stack" and all roots need to be above it.

1:11:25 drmeister I'd forgotten that to do this properly you need to do it carefully, (1) get the address of a local variable on the stack, (2) pass it to a noinline function that does all allocation.

1:12:15 drmeister I was declaring the local variable and allocating objects in the same function and the compiler rearranged them so that roots were below what it thought was the cold end of the stack.

1:12:54 drmeister And so very important thread local data structures were being collected and everything broke.

1:15:42 Bike one function does all allocation? or just all thread local allocation?

1:16:35 stassats a function call should be separating the point of registering stack roots and using them

1:16:59 Bike i see...

1:16:59 drmeister The function that set the stack_base/cold_end_of_the_stack also allocated the ThreadLocalState on the stack - the order is determined by the compiler and not the order I expected or wanted.

1:17:29 Bike oh, so it reordered the stack frame so the stack wasn't where you thought it was, i really do see.

1:17:33 stassats my fix was using pthread_attr_getstack

1:17:45 drmeister Yes - and I did that for the main thread for years - but the multithreading is new and I forgot that rule.

1:17:51 Bike how useful

1:18:52 drmeister stassats: An atomic-incf function - it should take a symbol and operate on the symbol-value of that symbol - correct?

1:19:17 Bike then you can't atomic-incf a structure slot or cons slot or anything

1:19:27 stassats drmeister: a place

1:19:30 drmeister Right - that's what was nagging me.

1:20:09 drmeister A place - how do I do it on a place? That needs a macro and I do atomic things in C++.

1:20:47 drmeister I'm missing something simple.

1:21:13 stassats you define different operations on different places

1:21:15 Bike you could have a macro that expands into whatever based on the place. like setf. i think sbcl does this.

1:22:10 drmeister So we need a new macro facility - like setf. I can see the value of it. We should steal sbcl's