freenode/#clasp - IRC Chatlog
Search
15:40:44
drmeister
After converting about a dozen invocations of TYPEP to xxxx-p predicates in the AST->HIR->native code, compile-file-parallel gets up to using 450% CPU.
15:45:27
beach
Also, like I said, TYPEP with a constant type descriptor can be implemented efficiently without calling FIND-CLASS.
15:47:25
beach
Secretly creating a generic function for some types, much like the ones you created manually.
15:48:13
beach
Bike: It appears to be the lock itself used in the traditional solution to multiple-reader/single-writer problems.
15:49:25
drmeister
Yeah. I watched a lecture on it last night. Acquiring a read lock involves writing to shared memory. That forces exclusive access on that memory for one CPU and invalidates caches in all the other CPU's.
15:49:32
beach
Right, but there must be a short lock to make sure there are no writers. That one appears to be the problem.
15:49:45
drmeister
So the exclusive access to the memory for the read lock ping-pongs around the different CPU's.
15:51:26
beach
Bike: You are right that there may be no control path from the first instruction of a TAGBODY to the last one, but if you don't include some special kind of edge, aren't we going to have problems with ownership calculations again?
15:52:32
drmeister
There something called Read-Copy-Update that they start to describe here: https://www.youtube.com/watch?v=BcAED2f3z0I
15:53:08
drmeister
It's used in the Linux kernel for this purpose - lots of readers, occasional updating and you want reading to be very fast.
15:53:09
Bike
beach: I mean say we have (progn (tagbody loop ... (go loop)) (foo ...)) - the foo call is never actually reached so we don't have to worry about who owns it cos it's deleted. but in (progn (tagbody ... end) (foo ...)) the end tag will just have the foo call as a normal successor. so everything should be fine.
15:53:28
drmeister
That lecture I just posted illustrates the problem with multiple-reader/writer locks.
15:54:49
Bike
i'm just a little worried there are things treating them as equivalent ot the arguments. i guess i'll see.
15:55:04
beach
drmeister: Oh, is this another case of a context switch being prohibitively expensive in "modern" operating systems?
15:56:10
scymtym
drmeister: do you already know whether readers are slowed down by the overhead of the read/write lock mechanism or whether writers actually hold the lock for extended periods of time?
15:57:45
drmeister
It's from the macOS tool "Activity Monitor" - it's like a backtrace and trace together.
15:59:41
drmeister
Ah - well, it looks like i'm not going anywhere for a while - the trains (that come every 30 min) are 36 min late.
16:02:01
Bike_
by the way, the generic function typep thing beach mentioned is what we already do for defstruct, so you could look at defstruct.lsp to see how it works (it's pretty simple though)
16:03:15
Bike
maybe i should try my make-instance biznis again, that would mostly eliminate find-class in make-instance
16:05:00
drmeister
What's happening is the exclusive access to the read-lock for find-class is ping-ponging all over the place.
16:06:42
drmeister
The child threads in compile-file-parallel are doing (TYPEP x 'cleavir-class) all over the place - I figured it was the same problem.
16:07:37
drmeister
That lecture I posted - the second half explains where the problem comes from - it's the CPU's cache algorithm that makes the locks possible that causes the problem.
16:08:25
drmeister
So I started replacing (TYPEP x '<cleavir-class>) with (<cleavir-class>-p x) and adding <cleavir-class>-p predicates.
16:12:15
Bike_
well, that's what i mean, i thought the locks would actually allow parallel use, but they aren't
16:12:32
Bike_
and i suppose there's nothing we can do about it really since they're just system locks
16:12:48
scymtym
do you store classes in cells? if so, couldn't FIND-CLASS be transformed into (load-time-value (find-class-cell …))?
16:17:35
Bike
and we have (typep x 'foo) => (classp x (car (load-time-value (gethash 'foo *class-name-hash*)))) or so
16:18:18
scymtym
the association name -> cell would never change so it could be looked up at compile-time (is the idea at least)
16:19:25
Bike
isn't it also what you were doing in sicl for classes? i thought sicl did that for practically everything
16:23:08
beach
Bike: I don't think I have gotten that far in my thinking. I hadn't imagined that FIND-CLASS would be a performance bottleneck.
16:25:30
Bike
well, it would come up with anything doing find-class in parallel- compilation is just what we're doing heavily
16:26:06
scymtym
as an aside: http://www.radgametools.com/telemetry.htm may be interesting for diagnosing these kinds of performance issues (until our clim-based tools are ready). caveats: it's probably not cheap and i don't know the platform constraints
16:28:27
Bike
honestly i'm kind of glad the problem is fairly easy to understand - i was worried it was going to be some undocumented llvm internal
16:43:58
kpoeck_
There is nearly no clos in them, perhaps thats a pattern that plays nice with the current implementation
16:50:07
drmeister
The problem with compile-file-parallel is that the algorithms for optimizing HIR call find-class like crazy.
16:55:39
drmeister
The Activity Monitor on macOS has this "sample process" function that is really nice - but it has a terrible flaw.
16:56:15
drmeister
It took me until last night to finally figured out how to read it. It's a backtrace and a trace rolled into one.
16:57:32
drmeister
They take a bunch of backtraces and merge together the common parts to get that. The top part is the backtrace and the bottom part that zig-zags partial backtraces that who each time sample.
16:57:59
drmeister
The horrible flaw is when running Cleavir - the backtrace part can get really, really deep.
17:08:38
drmeister
But it has a <Save> option - so I can dump it to a text file and bring it up in emacs.
17:14:51
beach
Bike and I are discussing new stuff that might be make it possible to remove CALL-WITH-VARIABLE-BOUND.
17:19:12
drmeister
Here is the next lecture on operating systems that describes Read-Copy-Update: https://www.youtube.com/watch?v=Jkoc0iWQtr8
17:41:43
drmeister
frgo: Have you used Read-Copy-Update for lock free reading of a data structure that is updated very infrequently?
17:43:50
drmeister
It requires that updates to the data structure happen when all threads that could access that data structure is blocked or at a safe point.
17:47:53
drmeister
I'm puzzling over how to implement this for the class table - currently a hash-table.
17:49:41
drmeister
I have an idea. If I put an alist in front of the class-table that contains add/remove class instructions and search that first and then search the class-table.
17:50:03
drmeister
Then - if I could block all of the threads and use the add/remove class instructions to update the class-table.
17:50:56
drmeister
AND I keep a pointer to the alist/class-table in thread-local storage for each thread. So find-class would go through this pointer in thread-local storage.
17:55:17
drmeister
Question: Can I set things up so that all threads reach a safe-point and pause so that I can update the class-table, update all of the thread local storage pointers to the class-table and then allow the threads to continue on their way?
18:12:09
drmeister
It's something to do with a global atomic pointer to a linked list of set-class instructions followed by the class table.
18:13:51
drmeister
It's a bit long. To summarize - the compile-file-parallel was experiencing a lock contention for the read-lock for the class table.
18:14:57
drmeister
Have you ever used Read-Copy-Update? https://en.wikipedia.org/wiki/Read-copy-update . It sounds like it's easier to implement in operating systems than user land.
18:18:31
drmeister
I would continue to use a hash-table for the class-table - but first find-class would search a linked list of "set-class" instructions that would shadow anything in the class hash-table.
18:19:03
drmeister
I can atomically update a global variable to the linked list of "set-class" instructions using CAS.
18:19:46
drmeister
Occasionally I would want to update the class hash-table using the set-class instructions so that the list of set-class instructions doesn't get too long.
18:20:03
frgo
Hm - IIUC then this requires some sophisticated handling of sequencing the updates. Yeah, if this could be done by appending to a list ...
18:21:13
frgo
What I don't get is how we access the linked list by multiple threads in a lock-free way ..
18:21:23
drmeister
I think the updating of the class hash-table using the set-class list is the tricky part. I need to get all threads to a safe point and then do the update, then erase the set-class list and then tell every thread to use the class hash-table.
18:23:05
frgo
You'd need to interrupt a thread, re-orient the class access to the class hash-table, and the tell the thread to continue.
18:25:12
drmeister
Can I tell every thread - the next time you reach a safe-point pause and wait until all threads have reached a safe-point and then update the class-table and then proceed?
18:25:54
frgo
So, establishing a signal handler in each thread - this handler is the "I need to wait for class access reorientation and this is signaled by a condition variable."
18:27:11
drmeister
Is this a terrible idea? If there is one thread that is off doing something compute intensive and it doesn't reach a safe-point for an hour - everything will block for that time.
18:27:12
frgo
I can't imagine how one can "insert" a safepoint into normal C++ code. Except I manipulate at a higehr level, AST, IR, ...
18:29:04
drmeister
It's a call to this: https://github.com/clasp-developers/clasp/blob/dev/src/gctools/interrupt.cc#L225
18:30:41
drmeister
https://github.com/clasp-developers/clasp/blob/dev/include/clasp/gctools/threadlocal.h#L49
18:36:22
frgo
Coming back to your point on a thread running for an hour: That's why I think you would have to interrupt the threads by a signal. You can't just wait until they reach a safepoint.
18:38:27
frgo
Still - tricky. What if you have 10000 threads ... Delivering signals to hat nr of threads is not exactly a cheap/ fast oepration.
18:39:29
frgo
Fortunately we don't support Windows. We had a case where exactly this scenario caused Win10 to crash - very reliably so. Microsoft is working on it ...
18:43:56
drmeister
I have to ensure that a safe point call is made periodically. It’s the eternal problem of c++ exceptions and Unix signal handlers not working together
18:44:50
frgo
Exactly. I would shy away for this very reason. Too many unknowns to get this stable - tells me my experience.
18:46:09
drmeister
It would only come up when redefining classes. Then all the threads would have to sync to update the class table
18:47:39
drmeister
I have an interim solution. Replace typep and find-class calls in the compiler with predicate tests.
18:49:12
drmeister
Changing code and avoiding using TYPEP and FIND-CLASS in future code until we implement Read-Copy-Update for the class table.
18:50:11
drmeister
If you want to run parallel code - make sure it doesn't contain TYPEP or FIND-CLASS calls - and that it avoid reader/writer locks - they are useless.
18:51:18
drmeister
No - it's not attractive. The only attractive solution I see requires read-update-copy.
18:52:32
drmeister
That requires synchronizing threads and that requires judicious invocation of safe-point.
20:02:50
drmeister
::notify Bike - I'm replacing the use of *invocation-context* special variable with code that passes it as a second argument to introduce-invoke-aux here...
20:07:07
drmeister
scymtym: What was the deal with transforming FIND-CLASS into (load-time-value (find-class-cell ...)) ?
20:07:55
drmeister
Instead of storing class-symbol-name/class we store class-symbol-name/(CONS class nil) in the class table?
20:09:56
drmeister
Yeah - Bike said... "and we have (typep x 'foo) => (classp x (car (load-time-value (gethash 'foo *class-name-hash*)))) or so"
20:14:37
drmeister
I realized that RCU is even more complicated with the moving garbage collection of MPS.
20:15:20
drmeister
Any GETHASH can trigger a REHASH - which would also have to be deferred to the class-table synchronization time.
20:16:00
drmeister
This is because objects move in memory and if they move and they can't be found because the hashing algorithm has them at a different place in the hash-table - then that would normally force a REHASH.
20:16:43
drmeister
I know how to handle this but it would require dropping down to a linear search of the class-table until the class-table synchronization happens.
20:39:08
drmeister
::notify Bike I've changed how classes are stored in the class table. Now (typep x 'foo) --> (classp x (car (load-time-value (ext:find-class-cell 'foo))))
21:00:18
drmeister
I wonder what the consequences would be if I made every CONS cell have an atomic CAR and CDR.
21:26:19
drmeister
It's problematic to return a cons cell containing a class in the car. Unless the car is atomic there could be race conditions.
21:50:54
Colleen
Bike: drmeister said 1 hour, 48 minutes ago: - I'm replacing the use of *invocation-context* special variable with code that passes it as a second argument to introduce-invoke-aux here...
21:50:54
Colleen
Bike: drmeister said 1 hour, 47 minutes ago: Tell me what you think when you see this.
21:50:54
Colleen
Bike: drmeister said 1 hour, 11 minutes ago: I've changed how classes are stored in the class table. Now (typep x 'foo) --> (classp x (car (load-time-value (ext:find-class-cell 'foo))))
21:52:21
drmeister
I've saved my previous work and now I'm in a separate branch exploring making the car and cdr of cons cells atomic. Also Symbol_O::_GlobalValue is now atomic.
21:53:12
Bike
introduce invoke will be going away with the changes to hir, so dont sweat it too much
21:54:22
Bike
with the cells we would juzt need atomic accessors, it's ok if the regular accessors aren't, though i dont' know how it goes on the C++ ends
21:56:38
Bike
http://www.sbcl.org/manual/#index-compare_002dand_002dswap has a list of places sbcl can access atomically or through cas
21:58:29
drmeister
That sbcl thing is all very well and good - but in C++ land I think I need to declare the types as atomic - then the compiler should make sure no funny business happens with them.
21:58:45
drmeister
I don't think I can use atomic operations on non-atomic data - I think that would be dangerous.
22:00:18
Shinmera
That reminds me of another thing that's vastly complicated by not having a GC: lockless datastructures
22:01:07
Shinmera
I mean, it can get pretty dicey even with a GC, but trying to read some papers on how to do the same in C++ without a GC made me think a noose would be a better option
22:03:34
drmeister
What's nice about C++ is once I declare those slots as std::atomic<T_sp> - then the compiler starts spitting out all sorts of compile time error messages when the code violates the type rules.
22:05:28
Shinmera
If you guys expose a CAS to lisp that works on svref and struct places, let me know.
22:05:46
drmeister
That's the exploring part - I keep making changes to the code to spread the "atomicness" around. If things keep making sense more and more code compiles. If it doesn't - then I stop and trash these changes.
22:07:12
drmeister
https://github.com/clasp-developers/clasp/blob/hashtable/src/llvmo/intrinsics.cc#L113
22:19:11
Bike
i mean a symbol value read is a single atomic op, but we have two steps, 1 grab reference 2 read it
22:47:12
drmeister
Hmm, I would have to change how bclasp works with special variables if I wanted to proceed.
22:48:36
drmeister
I cant store classes in cons cells - because I can't guarantee that they will be read atomically.
22:51:05
drmeister
I think I do - the C++ compiler forced me to make Symbol_O::_GlobalValue atomic if I make the car and cdr of a cons cell atomic. This is because of how the symbol bindings work in clasp now that we have multithreading.
22:53:36
drmeister
No - every access to a cons cell would have to be through atomic operations. .load() .store(val) and .compare_exchange_strong(expected,new_value)
22:54:07
drmeister
Fooling around with making their car and cdr std::atomic<T_sp> - whaddayathinkofthat?
22:54:50
drmeister
Making the change and then fixing a bunch of compiler errors to propagate the atomicness out to the rest of the code.
22:55:08
stassats
if that'll allow to use std::compare_and_swap_something, then sure, but if it forces to serialize on every write, then not so sure
22:55:28
drmeister
But I hit a snag in how bclasp works with values - it hands out references to them - and I can't tell the receiver that these are atomic references.
22:56:28
drmeister
Yeah - how bad would it be if it forces to serialize on every write? I think that's what will happen. The compiler will ensure that the car and cdr are always read and written atomically. Would that be a huge drag?
23:02:39
stassats
but in general, making things that are not going to be used atomically atomic is not the best idea
23:08:07
drmeister
Yeah - that assignment involves this call: call std::__atomic_base<unsigned long>::operator=(unsigned long)
23:12:58
drmeister
Ok, does that help me though? Should I make them atomic and then use memory_ordered_relaxed everywhere that I don't care about memory order?
23:14:17
drmeister
I don't have a way to deal with atomic values in Common Lisp yet. I haven't made the conceptual leap.
23:14:41
drmeister
stassats: Do you have a few minutes - I can explain where this all came from - it's a good story.
23:15:44
drmeister
I figured out why my compile-file-parallel was only giving me about 1.5x speedup - it's the read lock in FIND-CLASS that is causing lock contention.
23:16:37
drmeister
So I've been trying to figure out how to fix all of the calls to (typep x '<cleavir-class>) that are invoking find-class and that are leading to lock contention.
23:17:04
drmeister
I started by replacing (typep x '<cleavir-class>) with predicates (<cleavir-class>-p x) - but that grew tiresome.
23:19:26
drmeister
I could - yes - for these cases. I thought instead though that I would put each class in a cons cell - and then use (classp x (car (load-time-value (ext:find-class-cell '<cleavir-class>))))
23:21:10
drmeister
In case some thread updates the class and code somewhere holds the cell for that class...
23:21:59
drmeister
I understand that these are cleavir classes and we aren't going to be modifying them - but still...
23:22:52
stassats
but even if it's atomic, what if you get the wrong value after you return from CAR?
23:23:50
stassats
suppose you have (when (typep x 'class) do-things-to-x) and something is redefined it at (when (typep x 'class) [here] do-things-to-x)
23:24:08
stassats
your atomicity won't help, it's the job of whoever does concurrent class changes to make things thread-safe
23:28:17
drmeister
Do you think it's a reasonable idea to store the class in the class table in the car of a cons cell and to let that cons cell out into Common Lisp land?
23:30:04
drmeister
Why not use a cons cell? I've been using them for cells in a couple of other places.
23:33:25
drmeister
Hmm, that seems a bit overkillish. Why would you do that? I'm asking because your rational may influence other things I'm thinking about.
23:33:43
stassats
modifying a class is not going to become atomic by declared a wrapper structure atomic, so you'll have a lock of some sort
23:34:46
stassats
why do that? that's how you program things, by creating structure out of amorphousness
23:38:42
drmeister
Then (load-time-value (find-class-holder '<cleavir-class>)) will return one of those.
23:39:07
drmeister
I'll still need a lock of some sort to read it - correct? That's what you were talking about
23:39:40
drmeister
(with-class-holder-lock-get-class (load-time-value (find-class-holder '<cleavir-class>)))
23:41:15
drmeister
So a function would work? (get-the-class-and-dont-give-a-crap-if-anyone-writes-to-the-class-holder (load-time-value (find-class-holder '<cleavir-class>)))
23:42:05
drmeister
And then (setf (find-class 'foo) new-class) uses serializes writes to the class-table with a lock?
23:43:42
drmeister
This seems dangerous - I was investigating the Read-Copy-Update approach for the class-table. https://en.wikipedia.org/wiki/Read-copy-update
23:44:54
drmeister
I was reading that wiki page and I was watching a couple of lectures on designing operating systems where they described Read-Copy-Update.
23:46:12
drmeister
I don't know, I thought if I don't declare things std::atomic that BAD THINGS would happen.
23:47:12
stassats
the problem with reordering is say you're changing a class from 3 slots to 5 slots, you're writing 3 to NSLOTS, then another code is iterating over all slots, it sees NSLOTS as being 5, but the slot vector isn't visible yet, and it's still holding 3 slots => error
23:47:36
drmeister
Seriously though - I thought worried about subtle race conditions that would be very difficult to debug if I don't use std::atomic.
23:49:59
drmeister
I probably can't plaster over the code with a consistent memory model at this point.
23:52:38
drmeister
I really appreciate the help here. So I'll create a ClassHolder_O class and have it store a class - I'll put those into the class-table. I'll read them without a lock (I'm nervous about this) but write to them with a lock.
23:56:12
drmeister
I don't have fences. I have pthreads mutexes and I have std::atomic and copy-and-swap.
23:58:30
stassats
i can't say that it'd be the right decision not now, but now, that's what i would've done
0:00:13
stassats
the only possible issue with this - the class itself is not decent yet, but the value will always be right
0:01:34
stassats
and you're attaching it to the symbol as well, if that works, then the wrapper cell would automagically work
0:05:27
drmeister
That just seemed way too simple. I could write the class into a slot in the symbol - that would be wasteful I guess because most symbols would not have a class.
0:05:47
drmeister
I could write it into the property list of the symbol - but people could screw with that.
0:07:29
drmeister
I'm starting to lose my train of thought. I am going to break for a bit and think about it.
0:24:18
drmeister
Hmmm, of course none of this class-holder stuff is portable - so we can't use it in the cleavir code. I'll have to talk with Bike about this.
0:28:26
drmeister
I figured out how to read the macOS Activity Monitor <sample process> output. It's really useful for figuring out what is going on.
0:33:16
stassats
if it doesn't exist you can actually stuff a function to run there, which then changes itself to run something else after it's determined what exists