freenode/#sbcl - IRC Chatlog

14:23:31 stassats not copying &rest above the frame is not without its problems

14:23:48 stassats since if there's nothing to move it touches more memory, to establish a new frame

14:25:49 stassats and tail-calling becomes a headache

14:26:56 stassats the other option is burning a register on a second frame pointer

14:29:03 stassats wonder if i should pursue it further or cut my losses

14:30:21 stassats on a large number of &rest it's a clear win, but it requires modifying a lot of call/return vops and is slightly slower when the number is small

14:34:33 flip214 Would it make sense to move &rest to some heap-allocated frame instead of holding them on the stack?

14:34:48 stassats absolutely not

14:35:49 flip214 apropos, I'd really appreciate a pair of functions that would allow to split a closure up into a code and an environment pointer, so that C callback structures that use a single (void*) argument for many functions (like a C++ vtable) can be easier handled

14:38:08 stassats why does it have to be a closure?

14:39:39 pfdietz Is (lambda (&rest r) … (apply #'fn r)) optimized as a tail call?

14:39:54 stassats yes

14:41:41 stassats it's not very optimized, as it has to copy R twice

14:44:59 flip214 stassats: because I need to pass some state around

14:45:12 stassats are closures the only way to pass state?

14:48:25 flip214 stassats: well, in this case I'm pushing multiple data items to C, and the C functions will run the callbacks at some later time, possibly one after another.

14:48:34 flip214 So I can't use a global or special variable.

14:49:00 stassats so other than closures or global variables, you can't pass anything?

14:49:30 flip214 and instead of allocating some struct manually, storing my data in there, and passing its locked address around as a void*, I hoped that I could do that implicotly

14:49:40 flip214 *implicitly via the environment of closures

14:50:06 stassats and if closures is all you have, what's wrong with them anyway?

14:52:08 flip214 instead of creating fresh (C-api) structures of closures all the time, I'd like to have one static structure with all the functions set up, and get the environment passed in via the (void*)

14:52:49 stassats i still don't understand because i guess you're describing a solution, not the problem

14:53:24 stassats is all you want a callback on #'funcall or something?

14:54:04 flip214 there's a (foreign) structure that stores 8 or 10 function pointers. I allocate one, store closures in there, and call the C api with it. many, many times.

14:54:45 flip214 I would have hoped that I could instead allocate _one_ of these structures, store function pointers in there, and pass that _single_ structure to the C api _every_ time.

14:55:21 flip214 but then I need the environment passed around via the available (void*)

14:56:07 flip214 as an implicit way to pass the required state on, instead of allocating some class with the data myself

14:57:01 stassats i don't get it, why can't you allocate it once now?

14:57:53 flip214 becauses these are queued up and called at some later time

14:58:15 stassats ok, and how can you do that without allocating anything then?

15:00:26 flip214 I'd like to avoid _manually_ allocating something to keep the state. The dynamic environment has all the information, so I'd like to pack that into a void* and send it on

15:00:58 stassats send closures?

15:01:29 stassats closures are the environment, so what else do you need?

15:02:44 flip214 I guess it would be more easy to explain in person.... sadly you won't be at SBCL20, but perhaps I'll get a chance at the next ELS

15:03:52 stassats i doubt it, i'm not even close to getting how the normal closures are not suitable

15:11:33 flip214 yeah, seems I can't explain coherently enough... thanks for the patience, anyway!

15:14:53 pfdietz One problem with closures is they can't be serialized.

15:15:19 stassats` for some value of can't

15:16:01 pfdietz Support for closure savings/loading in fasls would be really nice for some things I've tried to do.

15:16:30 stassats` fasls are not required to do so, so...

15:16:43 pfdietz Right.

15:31:03 puchacz hi, can somebody tell me please if I am on the right track, I try to prepare a bug report that happens only if I run my program for about 30 - 60 minutes on 32 (virtual) computer. I did not isolate it yet, but I tried to follow http://www.sbcl.org/manual/#Signal-Related-Bugs - I recompiled sbcl with :sb-show, :sb-show-assem, :sb-qshow, :sb-xref-for-internals and :sb-hash-table-debug

15:32:03 puchacz with these settings, I delivered image to the 32 core server and when the bug appeared, it did not corrupt the image this time, but I still tried to follow the steps with gdb and got this: https://paste.ubuntu.com/p/GJphMPXRwV/

15:33:36 puchacz before when hunchentoot unbound session secret appeared, and USOCKET:BAD-FILE-DESCRIPTOR-ERROR signal, the image was corrupted with the message "continuing with fingers crossed", e.g. https://paste.ubuntu.com/p/cbC7jWjsmz/

15:34:23 puchacz maybe this extra safety settings prevented image corruption or I did not wait for long enough

15:43:02 stassats` or because sb-show slows things down

15:43:25 puchacz stassats: shall I remove it?

15:43:35 puchacz sorry, stassats`

15:43:38 stassats` then you won't know what's going on

15:45:02 puchacz if I manage to isolate it, shall I submit the program that triggers it even if all I get is something like https://paste.ubuntu.com/p/cbC7jWjsmz/ or not worth it?

15:45:21 puchacz mind it never happened on my desktop PC, only 32 core computer

15:45:24 stassats` but you already know where the error happens, in 0x539f3163

15:46:06 puchacz I never debugged with ldb or gdb

15:46:19 stassats` you'll get the function name with (sb-di::code-header-from-pc (sb-sys:int-sap #x539f3163))

15:47:37 puchacz okay, I will recompile without these extras and try to provoke it again - I don't have the previous image anymore

15:47:49 stassats` yes, i you can reliably get cbC7jWjsmz/, then that's all that's needed

15:47:54 stassats` if you

15:48:27 stassats` also run with --lose-on-corruption

15:48:37 puchacz so it will quit

15:48:46 puchacz and then run again, ask for the function name, right?

15:48:53 stassats` no, it won't quit

15:49:01 stassats` it will descend into ldb

15:49:21 puchacz this is lisp, not ldb: (sb-di::code-header-from-pc (sb-sys:int-sap #x539f3163))

15:49:42 stassats` you'll have to run that in gdb

15:49:51 stassats` not that, but a different name

15:49:59 stassats` component_ptr_from_pc

15:50:43 stassats` gotta add that as a command to ldb

15:50:52 puchacz and then get name?

15:51:12 stassats` you'll get an address, you can then add a lowtag and print it

15:51:14 stassats` easy!

15:51:39 stassats` or with 15 and "print" the result in ldb

15:52:05 puchacz I don't know the syntax, I will stick to Lisp form: (sb-di::code-header-from-pc (sb-sys:int-sap #x539f3163))

15:52:11 puchacz with the right address of course

15:53:20 puchacz is there a manual for ldb?

15:54:29 stassats` help

15:55:34 puchacz tks, recompiling sbcl first with regular --fancy, nothing extra

16:01:04 stassats` another way to get rid of &rest copying, if the caller knows how much stack space the callee allocates and do that on its behalf, and then push all the arguments

16:03:08 stassats` difficult with all the different function types, closures, symbols, callable instances

17:31:01 puchacz okay, I got sbcl corruption and as stassats advised, I checked the function where it happened:

17:31:06 puchacz 0] (sb-di::code-header-from-pc (sb-sys:int-sap #X53af25c3))

17:31:06 puchacz #<code id=1175 [1] SB-IMPL::OUTPUT-UNSIGNED-BYTE-FULL-BUFFERED {53AF245F}>

17:31:30 puchacz it it useful or I still need to isolate a a small program that triggers it?

18:19:47 pfdietz Do you have more than one thread writing to the same stream?

18:20:38 puchacz pfdietz: it happens inside hunchentoot, and I think it is some sort of internet error, like I recorded yesterday, here: https://paste.ubuntu.com/p/cbC7jWjsmz/

18:20:44 puchacz see, bad-file-descriptor

18:22:09 puchacz so I don't do anything

18:25:31 flip214 puchacz: you are sure the machine is okay, hardware-wise? did you run a memtest already?

18:26:30 puchacz flip214 - it is a virtual computer at a provider, I rented it because I wanted to run a simulation on 32 cores

18:26:37 puchacz it never happened on my PC

18:28:15 puchacz but I deleted it and restored it few times, I tried 2 locations, so must be different hardware. unlikely it is a "hardware" (virtualised or otherwise) problem

18:29:57 pfdietz Does it work with CCL?

18:31:28 puchacz I haven't tried full setup, but it looked to me that for my problem CCL (when I ran on my PC) was about 10 times slower, so I gave up.

18:35:48 puchacz I still have it up, if you have a suggestion what I shall type into REPL

18:39:45 pfdietz Is it reproducibly that same failure, in SBCL?

18:39:58 puchacz it looks like it is the same failure, but

18:40:32 puchacz I did not isolate it, so I observed with my full program running and

18:40:44 puchacz it only happens on the 32 core virtual computer, not my PC

18:40:55 puchacz it happens after about an hour of running

18:43:23 flip214 puchacz: and you're running full-speed against hunchentoot there? or a limited amount of traffic?

18:45:11 puchacz full speed

18:45:50 puchacz hunchentoot is an "interface" between my program on that virtual computer that runs simulations with different parameters, and Mathematica running on my PC that provides these parameters via HTTP

18:46:25 puchacz so no simultaneous calls but hundreds and then thousands of calls in rapid succession.

18:47:22 flip214 so basically single-threaded call patterns? hmmm

18:47:35 puchacz yes, but my simulation is 64 threaded

18:47:59 puchacz split the work in hunchentoot handler into 64 pieces, then join them all and respond

18:48:08 flip214 via cl-parallel, or manually?

18:48:12 puchacz manually

18:48:21 flip214 do you know whether you trigger GC often or very seldom?

18:48:34 puchacz no idea, but I create a lot of garbage.

18:48:35 flip214 so using sb-threads:join-thread et al

18:48:48 puchacz I don't call anything like this explicitely

18:49:08 flip214 well, how do you run 64 threads in parallel then?

18:51:05 puchacz flip214: like this: https://paste.ubuntu.com/p/Zny76fJpyf/

18:51:38 puchacz locks and then notify when every function ends

18:51:51 puchacz so hunchentoot thread waits

19:51:50 puchacz okay, it seems I will need to prepare an isolated test case and send. Realistically - next weekend :)

19:58:54 stassats puchacz: running with --lose-on-corruption will give you a backtrace

20:00:44 stassats bad file descriptor, faulting when writing to a buffer, it appears you're touching a stream that has been already closed

20:02:18 puchacz stassats, I will try again with --lose-on-corruption

20:04:53 stassats puchacz: how many simultaneous connections does it have?

20:05:24 puchacz one at a time, but I spawn 64 threads.

20:05:38 puchacz let me re-paste what I explained before

20:05:56 puchacz hunchentoot is an "interface" between my program on that virtual computer that runs simulations with different parameters, and Mathematica running on my PC that provides these parameters via HTTP

20:06:09 puchacz so no simultaneous calls but hundreds and then thousands of calls in rapid succession.

20:06:26 puchacz my simulation is 64 threaded, split the work in hunchentoot handler into 64 pieces, then join them all and respond

20:06:28 puchacz done

20:08:12 stassats next i'd modify HUNCHENTOOT::*SUPPORTS-THREADS-P* to always default to NIL and try that

20:08:44 puchacz I will try to get ldb stack trace

20:08:48 puchacz first

20:09:12 puchacz but you are right, hunchentoot does not need to be parallel, it is not a bottleneck

20:09:14 stassats sure, but i have a feeling it's not going to help, but it's useful to know that it's not going to help anyway

20:09:39 puchacz stassats, maybe I will learn something about low level debugging :)

20:09:41 stassats does usocket use finalizers or something?

20:09:56 puchacz I did not check, all standard quicklisp

20:12:13 stassats sb-bsd-sockets uses finalizers, so, the answer would be yes

20:14:02 stassats but all that does is closing the socket, might explain the bad-file-descriptor-error, but not the segfault

20:14:37 puchacz stassats, did you see I recovered the function name as you advised?

20:14:38 puchacz #<code id=1175 [1] SB-IMPL::OUTPUT-UNSIGNED-BYTE-FULL-BUFFERED {53AF245F}>

20:14:45 stassats yes

20:14:48 puchacz ok

20:15:08 stassats not really useful

20:15:31 stassats but, paste the (disassemble #'SB-IMPL::OUTPUT-UNSIGNED-BYTE-FULL-BUFFERED) anyway

20:20:01 cosimone_ ** NICK cosimone

20:20:27 stassats ok, not necessary

20:20:33 puchacz just done it

20:20:35 stassats so, the buffer is also finalized

20:21:29 puchacz https://paste.ubuntu.com/p/qrYP9Wvq9Z/

20:22:57 stassats ok, faulting at (setf (sap-ref-8 (buffer-sap obuf) tail) byte)

20:24:09 puchacz (it is running the simulations now with --lose-on-corruption by the way, if we are lucky it will corrupt it in 30 minutes maybe)

20:24:44 puchacz it ALWAYS corrupted it on 32 cores, just with different times, but not too long

20:25:35 stassats but you're connecting more than once?

20:26:36 puchacz mathematica sends something like http://my-ip/simulate?x1=3423.324&x2=34.7&x3=.... etc. and waits

20:26:51 puchacz hunchentoot starts 64 threads to simulate with the x1 x2 x3 etc.

20:27:02 stassats how often does it send it?

20:27:17 puchacz when all threads complete, it sends back in the same handler one number as a result, simulation score

20:27:38 puchacz mathematica immediately after receiving it sends another http request, with different values of x1 x2 etc.

20:27:45 puchacz and so on

20:27:47 stassats ok

20:31:37 puchacz it seems to be completing one simulation (so one http request / 64 threads / response cycle) within say 3 seconds

20:31:59 stassats ok, don't really need the frequency, just that it's more than once

20:32:08 puchacz ok

20:35:54 cosimone_ ** NICK cosimone__

20:36:43 cosimone_ ** NICK cosimone

21:04:56 stassats puchacz: what sbcl version?

21:05:06 puchacz 1.5.8, but I also tried 1.5.5

21:20:21 puchacz crashed into ldb

21:23:05 puchacz okay, the stacktrace in ldb tells me https://paste.ubuntu.com/p/4wNywMKBWB/

21:23:35 puchacz that it was another requester, totally different from Mathematica trying to reach my regular web application at "/"

21:23:50 puchacz it tried to activate postgres but there is no postgres on this computer

21:24:23 puchacz so the reason it worked on my PC was simply that there were no web spiders etc. randomly trying to knock at port 8080

21:25:12 pfdietz Huh!

21:25:23 puchacz whereas on the rented computer at the provider site, it is being inspected by all sort of spiders

21:25:37 puchacz I mean it is quite unexpected that it crashed sbcl :)

21:25:59 puchacz rather than saying something like no postgres connection, which is a socket protocol, no native library loaded

21:26:16 puchacz but the problem is probably easily solvable for me

21:26:28 puchacz just disable all other entry points

21:27:22 puchacz in Lisp I have a habit of having one image.... just save-and-die, load it with all the extra baggage I don't need

21:27:44 puchacz thank you all for your support!!!

21:28:53 puchacz as a side note, I would like to learn more about low level debugging

21:29:25 puchacz may come handy. any pointers other than help command in ldb?

21:29:43 pfdietz There's a CL library for handling ELF format. The dream: a gdb replacement written in CL.

21:30:05 pfdietz I think there still needs to be a DWARF2 parser.

21:30:19 puchacz pfdietz: shall I browse a book about working with gdb maybe?

21:37:10 puchacz (and sorry for confusion, it is a regular bug in my application after all, to leave the other handlers in)

21:37:55 |3b| "regular bug" shouldn't corrupt the image :)

21:38:14 |3b| (assuming no SAFETY 0 involved)

21:38:50 puchacz no, I had (sb-ext:restrict-compiler-policy 'safety 3 3) and same for debug

21:40:31 |3b| was it compiled with that?

21:40:35 puchacz yes

21:40:42 |3b| looks like that cl-postgres function does do SAFETY 0

21:40:58 puchacz yes, but the sb-ext function should override it, I was told

21:41:23 |3b| yeah, assuming it was in effect when it was built

21:41:42 |3b| including when any .fasl were made, so not always obvious

21:41:43 puchacz I removed ~/.cache/common-lisp before building

21:41:52 |3b| ok, that should catch it then :)

21:41:56 pkhuong |3b|: the backtrace looks like it might be a cached stream

21:42:40 |3b| pkhuong: still seems like something sbcl should catch though

21:43:22 |3b| trying to execute the query even though there isn't a postgres server available does sound like user code bug though

21:43:23 puchacz when I prepare the image for the simulation, I clear fasl cache, then I start everything up (which takes long as it has to compile all that is required from quicklisp again), then I load data from postgres into a global variable, and save-and-die

21:43:43 puchacz ah, and I unload all native libraries before saving the image

21:43:46 |3b| ok, so you probably left a live postgres connection cached in the image

21:43:51 puchacz so the connection was left there

21:44:10 puchacz it is just a strange habit to have the same image for a web application and numerical simulation :)

21:44:51 |3b| nah, being lazy and reusing something that is already set up doesn't sound that strange :)

21:46:17 puchacz anyway, if anybody wants more isolated program that would trigger the same crash, I can try to prepare it on next weekend, ping me at piotr.wasik@gmail.com

21:46:50 puchacz if it is worth fixing even, but ldb trace tells it all I think

21:59:50 |3b| yeah, looks like writing to streams left open from when image was saved gets corruption warning on linux

22:00:57 puchacz can streams even theoretically survive saving the image?

22:01:02 |3b| doesn't seem to object as much on windows

22:01:30 |3b| gray-streams could

22:02:13 |3b| file maybe could try to reopen and seek to same place or something, but i suspect that would be worse than erroring as often as not

22:02:43 puchacz Lisp level errors are OK, low level errors scare me off

22:02:43 |3b| network probably not even theoretically

22:02:48 |3b| right

22:05:08 |3b| though in this case it looks like it is the (foreign?) buffer rather than the stream that isn't surviving the save

22:06:03 puchacz okay, I am slowly running out of steam :)

22:06:18 puchacz thanks - and talk to you next time

22:06:33 |3b| yeah, seems to be isolated well enough now

22:07:08 puchacz I have enough knowledge now to run my simulations without undercover spy spiders interference

22:07:21 puchacz and if somebody wants a test case anyway, pls ping me.

22:07:31 |3b| maybe file a bug that writing to old streams after image save/reload should be handled better

22:08:08 |3b| test case = (defvar x (open ...)) (s-l-a-d) (write-byte 1 x)

22:09:15 puchacz very good, I copied it to a file

22:09:40 puchacz will raise it

22:25:57 |3b| ACTION wonders if those buffers get flushed on image save or not

22:45:07 pfdietz I would recommend not building a deliverable from a development image. You should have a script that builds the deliverable from scratch.

22:47:34 |3b| sounded like it was more or less doing that, just that part of the build involved grabbing data from a DB

22:48:00 |3b| and it happened to also include code that would try to reuse that db connection in some unexpected cases

22:50:12 |3b| arguably the db lib should clear its connection cache on image saving, but not many libs bother with that sort of thing :/

23:55:59 pkhuong BTS tracing isn't too bad https://gist.github.com/pkhuong/1ce34e33c6df4b9be3bc9beb22415a47