freenode/#clasp - IRC Chatlog

12:25:46 frgo ::notify drmeister Re MPS and signals: I think we need to change behavior in file src/gctools/interrupt.cc: ADD_SIGNAL( SIGSEGV, "+SIGSEGV+", ext::_sym_segmentation_violation); - Am I right that this estalishes a handler represented by the symbol ext::_sym_segmentation_violation? If so: when using MPS on Linux, we're not allowed to do that. If not: I'd like to know what this line actually does.

12:25:46 Colleen frgo: Got it. I'll let drmeister know as soon as possible.

12:28:32 Shinmera Why are you not allowed to do that?

13:05:04 frgo Because MPS on Linux relies on SIGSEGV being not handled by someone else. It uses SIGSEGV to manage memory.

13:05:53 stassats mps handles sigsegv fine

13:06:08 stassats receives sigsegv fine, not sure as to how it handles it

13:06:44 frgo Yes, it does. But if you install another handler, then this leads to MPS being prevented from doing its job.

13:07:08 stassats that's not he case here

13:50:57 drmeister Hello

13:50:57 Colleen drmeister: frgo said 1 hour, 25 minutes ago: Re MPS and signals: I think we need to change behavior in file src/gctools/interrupt.cc: ADD_SIGNAL( SIGSEGV, "+SIGSEGV+", ext::_sym_segmentation_violation); - Am I right that this estalishes a handler represented by the symbol ext::_sym_segmentation_violation? If so: when using MPS on Linux, we're not allowed to do that. If not: I'd like to know what this line actually does.

13:52:30 drmeister I'll try this.

13:52:46 frgo Morning drmeister

13:53:18 frgo stassats already pointed out that MPS actually receives SIGSEGV.

13:53:30 stassats and it works on the main thread, so

13:53:42 frgo (I don't have linux here right now to test)

13:53:46 drmeister Ok.

13:54:36 drmeister My friend at Ravenbrook got back to me and wants to see a backtrace. The machine I'm using is in the Amazon Cloud - so I can give him access as well.

13:54:46 drmeister I'm just waking up - need tea.

13:55:04 stassats drmeister: what i've uncovered, it actually receives multiple faults

13:55:12 stassats and resignals them

13:55:17 stassats ending with a fault at zero

13:55:23 drmeister Interesting...

13:55:35 drmeister I built a version with guards on - it took 7 hours.

13:55:42 drmeister It's like the bad old days.

13:55:46 Shinmera Like good old times.

13:56:00 stassats you have different attitudes

13:56:09 Shinmera I'm just sarcastic.

13:58:48 drmeister It doesn't reproduce the problem with the cases I tried last night - it works on simple cases.

13:59:57 drmeister Nope - it does - I was using cclasp - it behaves differently. In aclasp it fails like it did last night.

14:06:28 drmeister I can reproduce the problem and I passed it on to David along with stassats' observation.

14:12:40 frgo drmeister: I seem to have a problem with wscript in clasp: I get:

14:12:41 frgo Error >>>>>>>> In file included from /opt/common-lisp/lang/clasp/src/clasp/src/gctools/interrupt.cc:2:

14:12:42 frgo In file included from /opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/include/llvm/Support/ErrorHandling.h:18:

14:12:42 frgo #include "llvm/Config/llvm-config.h"

14:12:42 frgo ^~~~~~~~~~~~~~~~~~~~~~~~~~~

14:13:00 frgo well, the file *is* there.

14:13:31 drmeister I just realized something - I can create Amazon Cloud machines with Clasp running and give people access to them. Great for debugging.

14:13:46 frgo I don't see script setting the include path for llvm...

14:14:00 frgo s/script/wscript.

14:14:01 drmeister frgo: That was the same problem that you had last night - this is with the new externals-clasp build?

14:14:10 frgo Yes.

14:14:21 drmeister Could you paste your wscript.config file?

14:14:32 frgo Sure.

14:16:03 frgo LLVM_CONFIG_BINARY = "/opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/build-release/bin/llvm-config"

14:16:03 frgo EXTERNALS_CLASP_DIR = "/opt/common-lisp/lang/clasp/src/externals-clasp"

14:16:03 frgo LLVM5_ORC_NOTIFIER_PATCH = True

14:16:03 frgo SBCL = "/usr/local/bin/sbcl"

14:16:03 frgo LTO_OPTION = "thinlto"

14:16:40 frgo The llvm-config binary runs fine.

14:17:39 drmeister Yes - that is all fine - you can remove the EXTERNALS_CLASP_DIR line - that's not used anymore.

14:19:20 drmeister This is the contents of the llvm/Config directory that I think your system wants:

14:19:21 drmeister https://www.irccloud.com/pastebin/DuUWgeSB/

14:19:49 drmeister The peculiar thing is that I don't have an llvm-config.h file and I don't see the problem that you do.

14:20:16 drmeister What does your externals-clasp/llvm50/include/llvm/Config/ directory look like?

14:21:13 frgo Huh? No llvm-config.h? well...

14:21:14 frgo AsmParsers.def.in AsmPrinters.def.in Disassemblers.def.in Targets.def.in abi-breaking.h.cmake config.h.cmake llvm-config.h.cmake

14:22:59 frgo As soon as you actually build LLVM there is a llvm-config.h there... - in build-release/include/llvm/Config/llvm-config.h

14:23:21 drmeister Ah - ok

14:23:23 drmeister Right

14:24:18 drmeister https://www.irccloud.com/pastebin/Uus7cxg4/

14:24:43 drmeister That's my /externals-clasp/llvm50/build-release/include/llvm/Config/ directory - and yes there is an llvm-config.h

14:24:47 drmeister you are missing this?

14:25:24 frgo No - it's there: AsmParsers.def AsmPrinters.def Disassemblers.def Targets.def abi-breaking.h config.h llvm-config.h

14:26:07 frgo It's just that the directory ".../externals-clasp/llvm50/include" is not set as an include dir by wscript.

14:26:15 drmeister What do you get when you type this:

14:26:16 drmeister https://www.irccloud.com/pastebin/J4eews7H/

14:27:08 drmeister path-to-externals-clasp-bin-dir/llvm-config --include-dir

14:28:07 frgo "/opt/common-lisp/lang/clasp/src/externals-clasp/llvm50/include"

14:28:12 frgo and this ok

14:28:25 Bike shouldn't it be the build directory?

14:28:57 frgo Bike: TRUE!

14:29:11 drmeister Yeah - shouldn't it return the ... what Bike said

14:29:21 Bike it seems that the include has to be built. the source only has whatever kind of pre file.

14:29:22 frgo Holy cow. What's happening here ...

14:29:49 drmeister frgo: Do you have the log for the externals clasp build?

14:29:56 drmeister I can generate one and we can diff them.

14:30:17 drmeister Because this looks like an externals-clasp build issue.

14:32:30 frgo No - that's gone

14:33:01 drmeister Could you wipe it out and rebuild it - I'll do the same here and we can compare.

14:33:09 frgo Sure

14:33:52 drmeister I'm going to clone another copy and build that .

14:35:48 drmeister https://www.youtube.com/watch?v=qwnyIOoL-LM

14:36:00 drmeister "Away we go"

14:38:20 frgo This video is blocked.

14:38:39 drmeister Jackie Gleeson singing "Away we go"

15:04:00 drmeister stassats: How did you figure out that it gets the signal multiple times?

15:04:52 drmeister Ah - you hacked protli.c

15:16:30 frgo drmeister: Sent you email with build log for externals-clasp attached.

15:17:12 drmeister Did it create an llvm-config.h?

15:17:29 drmeister Or sorry - what does llvm-config --includedirs return

15:20:08 drmeister llvm-config --includedir

15:21:08 drmeister Because there is no indication in your log of any problem

15:21:43 frgo Ouch. I had misconfigured the path to llvm-config in my wscript.config.

15:31:14 drmeister Ok, once it's fixed use: ./waf configure build_cmps

15:31:45 frgo Thx again. Build is running fine now (356/366) ...

15:31:46 drmeister Although I'm starting to think there may be an issue with MPS and threading.

15:32:15 drmeister I see problems when I try to create >50 threads on OS X and then there are the problems that we ran into on Linux.

15:32:16 frgo There is. The way signals are delivered to threads.

15:32:50 frgo As we have pthreads we need to use pthread_kill() to send signals to threads.

15:33:31 frgo SIGSEGV is currently to delivered to threads in a pthreads-safe way.

15:33:45 frgo I just was looking into this.

15:34:57 drmeister Ah - excellent - please don't let me interrupt you.

15:35:41 frgo ;-) hehe - you interrupt me? What am I doing all day long with you?

15:39:56 drmeister frgo: I can give you access to the Linux machine that has Clasp built and exhibits the problem - would that help?

15:47:04 frgo Later on - I am setting up a small app that helps demonstrate the issue. For not havimg to build clasp over and over ;-)

16:11:24 drmeister I have to run an errand for an hour - I'll be back after that.

16:12:18 drmeister A friend is giving us a fruit (persimmon) tree - we need to pick it up.

17:19:48 stassats i have /tmp filled with clasp-log-14357

17:25:12 drmeister Ah - yeah - that's a build feature for debugging - I'll turn that off.

17:25:45 drmeister It writes out JITted symbols and their addresses and sizes to symbolicate backtraces.

18:16:01 drmeister The Ravenbrook folks sound pretty busy - we may be on our own for a while.

18:16:12 drmeister I'm reading MPS documentation again...

18:58:12 clasper drmeister: according to this video: https://vimeo.com/216547984 Azul has done a lot of work on the llvm for jited languages

18:58:52 clasper it is nice to know that they are making the llvm mare amenable to managed languages

18:59:12 clasper mare/more

18:59:51 drmeister clasper - thanks - I'm watching it now.

19:18:51 drmeister I inserted this into sigHandler:

19:18:52 drmeister printf("%s:%d caught signal sig=%d SEGV_ACCERR=%d siginfo_t=%p context=%p \n", __FILE__, __LINE__, sig, SEGV_ACCERR, info, context);

19:19:05 drmeister I get this:

19:19:36 drmeister https://www.irccloud.com/pastebin/Y3UPJ9Uj/

19:20:19 drmeister Why is SEGV_ACCERR=2? I thought it should be 11 (SIGSEGV)

19:20:41 frgo That's looking good. 0x7f.... addresses are on the stack.

19:20:57 drmeister This is on linux.

19:21:04 drmeister Does what you say still hold?

19:21:13 frgo So, these are pointers. Yes, it does.

19:22:50 drmeister So, what am I looking at here? Is sigHandler called because the program touched memory that had a barrier over it?

19:23:22 frgo and it is ok to get a SIGSEGV. What's the backtrace at that point? As we have a core we should be able to see what happened.

19:23:38 frgo Can't say without the backtrace.

19:24:09 frgo As stassats did: we'd need to look at what siginfo is telling us.

19:25:21 drmeister This is all I get from the backtrace when I look at the core file:

19:25:23 drmeister https://www.irccloud.com/pastebin/UkOKjgOe/

19:26:19 frgo Ok, so the trap that made us get a SIGSEGV is in another thread.

19:26:48 frgo We need to run lldb and execute cclasp in it.

19:28:02 stassats that will end badly

19:28:08 frgo ?

19:28:24 drmeister I don't get the same behavior when running within a debugger.

19:28:51 drmeister Also - I haven't gotten lldb to run on linux - it wants the llvm server running.

19:28:58 drmeister I use gdb on linux

19:29:00 stassats i'm preventing it from resignalling, the fault is at 0x88

19:29:02 frgo Oh.

19:29:48 stassats cmp r14,QWORD PTR [rsi+0x8]

19:29:51 stassats the instruction

19:30:09 stassats p $rsi

19:30:10 stassats $1 = 128

19:30:35 stassats the function is core::Cache_O::search_cache(core::CacheRecord*&):

19:31:13 stassats but they always change, so it's something not doing enough book-keeping, and not search_cache being broken

19:31:15 drmeister Ah - ok. That should be a thread local cache.

19:31:55 stassats now a fault at 0x7f4032e43ea8

19:32:03 drmeister But it's an old ECL style dispatching cache that I use for C++ messages.

19:32:22 stassats gctools::smart_ptr<core::T_O>::nilp (this=0x7f4032e43ea8) called by search_cache

19:33:54 stassats now a fault in 0x1c8, search_cache again

19:34:17 drmeister I'm checking to see if that cache is still thread local and being set up properly.

19:34:21 stassats now that i'm just aborting on bad addresses, i'm always getting faults in search_cache

19:35:18 drmeister Could it be that search_cache is doing something not thread safe?

19:35:41 stassats there's only one thread really running

19:36:34 drmeister Could it be that search_cache is doing something that is not MPS, moving garbage collection, safe?

19:36:53 stassats not happening on the main thread

19:37:24 drmeister It's an open hashed hash table of selector keys to effective method functions - for single dispatch C++ methods.

19:37:41 drmeister It is not location aware (yet) - but I was hoping to get rid of it and use fastgf.

19:37:42 stassats now i'm faulting at core::DynamicBindingStack::pop_binding , tried a different test case

19:38:07 stassats the fault address is 0x1fc

19:38:43 drmeister Could you explain a bit more about these fault addresses? These are addresses that the system is trying to read and causing a SIGSEGV?

19:39:23 stassats https://en.wikipedia.org/wiki/General_protection_fault

19:39:32 drmeister Where do you find the address? In the siginfo_t structure or the context passed to sigHandle in protli.c?

19:39:38 drmeister Reading... thank you.

19:40:04 stassats in siginfo

19:40:57 stassats most of the addresses are pretty low, i'm thinking you are accessing zeroed memory with some offsets

19:42:29 stassats are you registering your binding stack with mps?

19:42:51 stassats does it know about the values it is holding?

19:43:07 drmeister The binding stack is allocated within MPS memory.

19:43:48 drmeister The pointer to it (a root) is stored at the top of the stack. This is certainly true for the main thread. I'll recheck the code for threads.

19:43:58 stassats this->_ThreadLocalBindings.resize(index+1,_NoThreadLocalBinding<T_O>()); does that still happen in mps memory?

19:44:23 drmeister Checking...

19:45:54 drmeister Yes, it must be in MPS memory. _ThreadLocalBindings are defined here:

19:45:54 drmeister https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadLocalStacks.h#L33

19:46:19 drmeister It's a gctools::Vec0<T_sp>. That's a stretchy vector that is maintained with the GC managed memory.

19:46:32 drmeister It's like std::vector<xxx> but it works in the GC managed memory.

19:46:51 drmeister The DynamicBindingStack is defined here:

19:47:06 drmeister https://github.com/drmeister/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L14

19:47:33 drmeister The entire ThreadLocalState is stored in the stack of the thread, right when the thread starts up.

19:47:40 drmeister For the child threads that is here:

19:47:54 drmeister https://github.com/drmeister/clasp/blob/dev/src/core/mpPackage.cc#L109

19:48:35 drmeister The base of the stack is defined just before the my_thread_local_state.

19:49:00 drmeister And the stack is registered using mps_root_create_thread_tagged here:

19:49:07 drmeister https://github.com/drmeister/clasp/blob/dev/src/core/mpPackage.cc#L124

19:49:22 drmeister I'm doing this for my benefit mostly to document that I'm doing this correctly.

19:50:01 stassats drmeister: if ThreadLocalState is allocated before mps_root_create_thread_tagged, what happens?

19:50:28 drmeister Hmmm, that might be a problem.

19:51:05 drmeister The ThreadLocalState constructor does allocate memory using MPS calls...

19:53:11 drmeister Oh sh*t - I don't even have the allocation points initialized when I declare the ThreadLocalState on the stack.

19:53:16 drmeister Rearranging...

19:54:07 drmeister I think the order will be...

19:54:24 stassats another test, now i have three threads crashing at 0x240

19:56:00 stassats all coming from NEXT-RUN-TIME-MODULE-NAME...->vectorPushExtend

19:56:08 drmeister I changed the code to this...

19:56:21 drmeister https://www.irccloud.com/pastebin/xvmgyUzQ/

19:56:41 stassats all three threads vectorPushExtend into the same thing

19:57:00 drmeister I was seeing it crash in there as well.

19:57:03 stassats the same string-output-stream, i assume

19:57:11 drmeister vectorPushExtend

19:57:30 drmeister Compiling and linking...

20:01:35 drmeister I'm still getting segfaults

20:02:23 drmeister My backtraces are a lot less informative for the past couple of hours:

20:02:24 drmeister https://www.irccloud.com/pastebin/tz8J22rJ/

20:03:47 drmeister Can I get you any info that might be helpful?

20:05:12 stassats so, where is that string-output-stream coming from?

20:05:57 stassats if (destination.nilp()) {

20:05:57 stassats output = my_thread->bformatStringOutputStream();

20:06:02 stassats that's probably it

20:07:13 drmeister Is there anything I can tell you about that?

20:07:26 stassats why do multiple threads share it?

20:07:52 stassats why does a single share it?

20:07:53 drmeister Investigating and trying to remember...

20:10:40 drmeister Every thread gets its own _BFormatStringOutputStream

20:10:47 drmeister So they don't share it.

20:11:00 drmeister https://github.com/drmeister/clasp/blob/dev/src/gctools/threadlocal.cc#L208

20:12:08 drmeister It's just a thread local string-output-stream for (core:bformat nil ...)

20:12:09 stassats i'm clearly seeing three threads calling core::MDArray_O::vectorPushExtend with this being 0x7fc24a957020

20:12:39 drmeister Hmmm.

20:12:56 drmeister So they are all stomping on the same string-output-stream.

20:13:20 stassats from bformat

20:14:14 drmeister Could you give me the test case that you are using to generate this problem? But it's not completely reproducible - is it?

20:14:54 stassats (loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))

20:15:12 stassats i don't think that's the bformat call, some other call is actually causing

20:15:53 drmeister And you are using cclasp - starting it with iclasp-mps ?

20:16:01 stassats yes

20:18:16 drmeister Here's what I see - I don't think I'm reproducing what you see:

20:18:17 drmeister https://www.irccloud.com/pastebin/PeW69zgE/

20:19:17 stassats you don't have an abort in sigHandle

20:19:21 drmeister But I have guards on - I should rebuild with them off

20:19:27 drmeister What does your sigHandle look like?

20:19:55 stassats a print and an abort after it doesn't know how to handle it

20:20:27 drmeister For curiosity sake: What do you print?

20:20:46 stassats the address, naturally

20:21:52 drmeister So - like this?

20:21:53 drmeister https://www.irccloud.com/pastebin/OqOr4uV4/

20:22:34 stassats that's not after it doesn't know how to handle it

20:22:35 drmeister Or the abort is below the if(info->si_code == SEGV_ACCERR) {...} block

20:23:28 drmeister Right - so this:

20:23:30 drmeister https://www.irccloud.com/pastebin/Asi6wQPE/

20:24:05 stassats don't print the address before

20:24:09 stassats you'll drown in it

20:25:23 drmeister I'm not however - I've been doing fine with this. Could it be the guards that I have in place?

20:25:37 drmeister I only see 2 or zero print statements.

20:25:49 drmeister I'll move it down nonetheless.

20:27:16 drmeister I'm compiling this now:

20:27:17 drmeister https://www.irccloud.com/pastebin/DyA4ToHP/

20:29:32 drmeister The spacing is weird because I'm running this on a AWS machine.

20:31:48 drmeister I don't understand how three threads can have the same _BFormatStringOutputStream - each thread gets its own.

20:33:58 drmeister I put in a printf statement right after the _BFormatStringOutputStream is initialized - this is what I get when I run your testcase:

20:34:00 drmeister https://www.irccloud.com/pastebin/YPHMjrQP/

20:35:42 stassats that's not the bformat that's actually failing

20:35:49 stassats for me

20:36:09 stassats i have failures without calling bformat directly

20:36:14 stassats just from NEXT-RUN-TIME-MODULE-NAME

20:38:22 drmeister next-run-time-module-name is accessing a dynamic variable in a thread unsafe way.

20:38:30 drmeister *run-time-module-counter*

20:39:02 stassats do the names have to be unique?

20:39:16 drmeister No

20:39:25 stassats then it's safe enough for this exercise

20:39:42 drmeister But there should only be one compilation going on here.

20:40:16 stassats it's from dispatch

20:40:38 drmeister Ah - well, that can come from anywhere.

20:41:04 drmeister I should get rid of that special variable then and every module will have the same name.

20:41:27 stassats not a good naming scheme

20:41:32 drmeister Although this can't be the true problem because it was happening in aclasp as well - which doesn't do dispatch.

20:42:18 drmeister What would you recommend for the naming scheme? A thread local name with a counter that is thread local? Or put a lock around the name calculation?

20:43:45 stassats atomic-incf

20:44:37 drmeister Ok, I'll have to write that.

20:46:19 drmeister So, it's a function that takes a symbol value and does an atomic incf on its symbol-value

20:46:21 stassats does the main thread call start_thread too?

20:46:53 drmeister No, it doesn't. It has it's own complicated startup and shutdown process.

20:47:28 stassats does it repeat all the same stuff, dynamic bindings, mps registration?

20:47:49 drmeister Now that you mention it, I will check the exact sequence against the one in mpPackage.cc

20:48:41 drmeister I'll just get atomic_incf implemented and building and get right on it.

20:48:57 stassats postpone atomic_incf

20:49:02 stassats it won't solve anything

20:49:21 drmeister Ok.

20:50:36 drmeister Checking the order now...

20:59:25 stassats my_thread->_BFormatStringOutputStream seems to be indeed different, but the vector to which it push is the same

21:00:47 drmeister There are a couple of differences.

21:01:01 drmeister The main thread initializes the allocation-points too late.

21:01:30 drmeister And the main thread doesn't appear to call my_thread->initialize_thread() searching...

21:02:07 drmeister Ah - no - it does - checking the order relative to everything else.

21:11:18 drmeister Hmm, no - that's not it. The ThreadLocalState does not invoke the MPS allocators - so the order is fine and the order I had in start_thread previously was fine.

21:11:34 drmeister Which could be why rearranging things didn't make any difference.

21:11:53 drmeister Double checking my work and thinking...

21:18:45 drmeister stassats: In your core dumps - do you see more than two threads?

21:20:56 stassats when i created more than two

21:21:32 drmeister With this: (loop repeat 1000 do (mp:process-run-function nil #'(lambda () (core:bformat nil "module%s" 10))))

21:21:47 drmeister I only tried it once - I only got two threads in the core dump

21:23:55 stassats i'm currently using (loop repeat 100 do (mp:process-run-function nil #'(lambda () (catch 'x (eval '(throw 'x 10))))))

21:25:56 drmeister I've double checked the order of startup - everything appears fine.

21:26:23 drmeister There are differences - but allocation points are created before allocations take place.

21:26:45 drmeister I'm rearranging the start_thread code to exactly mirror the main thread code.

21:27:05 stassats just share the code?

21:28:13 drmeister I will do that next/now.

21:29:28 drmeister That takes more work - and I have to share the start_thread code - that is possibly giving us trouble.

21:29:40 drmeister The main thread code is more convoluted and does a lot more stuff.

21:30:51 drmeister The main thread code isn't a simple function like start_thread - it's more scattered and comes in from main(...)

21:42:08 stassats vectorPushExtend is called on (core::MDArray_O *) 0x7fc3286e9020

21:42:46 stassats i forget where i was going with thet

21:43:12 drmeister It happens to the best of us.

21:44:26 drmeister Is there a way to remove any memory protection on an address?

21:44:40 stassats just need to restrict the number of backtrace frames printed, otherwise i'm losing the numbers out of sight

21:44:52 stassats ok, these two threads push to different strings

21:44:57 drmeister Oh wait - wrong address

21:45:03 stassats but crash at the same place and at the same address

21:45:19 stassats crashing at this->_Data->rowMajorAset(idx+this->_DisplacedIndexOffset,newElement);

21:46:33 drmeister Is there anything I can tell you about that?

21:51:05 stassats this time the crash is at call QWORD PTR [rax+0x1a0]

21:51:17 stassats the vtable?

21:51:55 drmeister I don't know -

21:52:22 stassats it comes from the argument to the function

21:52:34 stassats anyhow, the stream is corrupted, it appears

21:53:39 drmeister This is the _BFormatStringOutputStream?

21:54:26 stassats yes

21:59:27 drmeister I'm single stepping through the initialize_thread method where the _BFormatStringOutputStream is created - I noticed there is a thread unsafe increment of a counter that I put in that counts whenever a class is allocated - I will make that thread safe using an atomic variable

21:59:50 drmeister But nothing depended on it.

22:01:11 stassats ok, i don't think there's any concurrent access happening

22:01:33 stassats just the gc not properly managing thread's memory

22:02:00 drmeister So the MPS has a problem?

22:02:12 stassats you have a problem with mps

22:02:18 drmeister ACTION finally goes there

22:02:33 drmeister I have a problem with the MPS or the MPS has a problem?

22:02:50 drmeister How did you reach that conclusion?

22:03:15 stassats took all inputs, ran the neural network on it

22:03:29 drmeister The one behind your eyes?

22:03:56 stassats you're not properly setting it up

22:04:20 drmeister Ok - any ideas where/how/what I'm doing wrong?

22:04:30 stassats is it able to stop the world correctly?

22:04:51 drmeister The MPS can - yes, there is a function call for that - would that help?

22:12:24 Bike wait, there is?

22:12:49 drmeister There is - what? A function to stop the world... I'm looking for it.

22:13:13 stassats it doesn't always crash, so at least something is working

22:13:28 stassats but it regularly crashes at bformatStringOutputStream

22:13:45 stassats suggesting it's either not pinned down or improperly allocated

22:17:51 stassats and i have a better test case

22:18:00 drmeister Ok

22:18:00 stassats (mp:process-run-function nil #'(lambda () (let ((x (make-string-output-stream))) (write-char #\a x) (loop (assert (find #\a (get-output-stream-string x)))))))

22:18:03 stassats (gctools:garbage-collect)

22:18:08 stassats The assertion (FIND #\a (GET-OUTPUT-STREAM-STRING X)) failed

22:23:00 stassats wait, get-output-stream-string should clear any characters, why is there a delay

22:23:21 drmeister Oh yeah.

22:23:30 stassats ok, bad test case, i don't have a way to check the string without clearing it

22:23:57 drmeister Do you want a get-output-stream-string-dont-clear ?

22:24:09 stassats not really

22:25:34 drmeister Put the write-char in the loop?

22:26:53 stassats ok, everything appears to work

22:27:38 drmeister What's that?

22:28:15 stassats whatever's working works

22:28:22 stassats i haven't found a non working piece of code

22:28:33 stassats other than the original test case with bformat

22:29:22 stassats btw, can't exit with threads running, mps throws a fit

22:29:50 drmeister I don't kill threads properly yet.

22:29:58 drmeister It needs that - right>?

22:30:00 stassats (mp:process-run-function nil #'(lambda () (loop (core:bformat "a%p" 10)))) crashes on its own

22:31:22 stassats huh, no, that's still from CREATE-RUN-TIME-MODULE-FOR-COMPILE

22:32:28 stassats huh

22:32:31 stassats it's the main thread

22:33:07 stassats the main thread crashes

22:33:13 stassats when printing COMMON-LISP-USER

22:33:20 stassats >

22:35:21 stassats now the crash is at 0xfffffffffffffff0

22:36:01 stassats i saw in some disassembly output mov reg, 0xffffffffffffffff asl reg,4

22:36:10 stassats it struck me as very peculiar

22:37:08 stassats like, why would a c compiler even produce such a sequence

22:37:33 drmeister Where?

22:37:51 stassats i don't remember

22:39:06 drmeister Do you have any recommendations at this point?

22:40:11 stassats looking at how the thread allocation points are initialized

22:46:14 stassats different test cases crash in different places

22:51:39 drmeister If you want a simpler test case - I can get it to crash in the interpreter - with no loaded Common Lisp.

22:52:07 drmeister (let ((c 1000)) (tagbody top (mp:process-run-function nil #'(lambda () (eval '(list 1 2 3 4)))) (setq c (- c 1)) (if (> c 0) (go top))))

22:53:10 drmeister I'm doing this now with guards on and using ./waf build_imps_d (debugging -O0)

22:54:13 drmeister So - I have 7 threads and thread 1 starts with a * next to it.

22:54:16 drmeister It has this backtrace:

22:54:18 drmeister https://www.irccloud.com/pastebin/SGe4EsVB/

22:54:51 drmeister This means the fault happened in frame #4? Or can the signal handler be caused by a fault in one of the other 6 threads?

22:55:13 stassats no, frame 4

22:56:26 drmeister Interesting - now it created about 100 threads and then failed with a different error

22:56:37 stassats yeah, i think we can stop with test cases

22:57:01 drmeister Why - do you have a best one? Or some other idea?

22:57:44 stassats they all point to the same problem — memory not being handled properly

22:58:54 drmeister Ok. There is something else to do - ASLR

22:59:01 drmeister ALSR

22:59:06 drmeister One of those - turn it off

22:59:07 stassats the thread struct, with local bindings, bformatStringOutputStream, should be a root itself

22:59:41 drmeister So - putting it on the stack is a bad idea?

22:59:51 drmeister I can make it a root itself.

22:59:53 stassats is it on the stack?

23:00:02 drmeister Yes, I put it on the stack.

23:00:15 stassats Process_O* my_claspProcess = (Process_O*)claspProcess;

23:00:15 stassats Process_sp process(my_claspProcess);

23:00:16 stassats void* stack_base = &stack_base;

23:00:16 drmeister I thought it was a cheap and easy way to make it a root and thread local

23:00:25 stassats is your &stack_base really the stack_base?

23:00:44 stassats is process below or abouve it?

23:00:52 drmeister That's what I pass to mps_root_create_thread_tagged

23:01:22 drmeister Process_O is allocated in the MPS memory - uh hang on...

23:01:25 stassats should've used at least &process

23:02:03 drmeister Wait - Process_O isn't allocated by the thread I don't think. Checking...

23:02:29 stassats even then, &stack_base isn't guaranteed to be the bottom of the stack where you put your other stuff

23:03:17 drmeister I thought it was - or do you mean things can be reordered.

23:03:19 drmeister Oh shit.