freenode/#clasp - IRC Chatlog

3:58:44 beach Good morning everyone!

3:59:15 beach drmeister: So why not try something really simple, use a multiplicative growth instead of trying to guess the initial size.

5:01:27 drmeister Hi - I'm trying

5:02:19 drmeister I've switched clasp's hash tables to open addressing - I can control the size now.

5:06:22 beach Why could you not control the size when it was linked?

5:09:09 drmeister Hmm, you are right - I could of - I had the count of entries.

5:09:53 drmeister Anyway - I'm trying to track down why the child threads compiling AST->native code appear to slow down the main thread.

5:18:30 drmeister I'm having a heck of a time trying to figure out why I can compile the AST's in 0.3 seconds but when I launch children that compile the AST's to native code the AST generation appears to take 10x longer.

6:15:21 drmeister beach: How would you implement a find-class for multi-threading?

6:16:25 drmeister I'm getting some evidence that my contention problem is the class database - it has an upgradable read/write lock using two mutexes.

6:16:30 beach I guess just lock the table of classes in the environment.

6:16:40 beach Oh, I see.

6:17:12 beach Let me think. I hadn't imagined it would be so heavily accessed.

6:17:42 drmeister In the compile-file-parallel - I put a loop around the hir transformations - so each child thread now runs my-hir-transformations (a bunch of hir transformations) 100x. The main thread slows down about 20x.

6:17:54 beach I guess it can happen when lots of DEFCLASS forms or lots of DEFMETHOD forms are accessed.

6:18:55 drmeister I sample the process and I see lots and lots of read locks being acquired for the class database. This is within map-instructions-xxx

6:19:28 beach I have no answer in real time, but I will give it some thought. Maybe some insight could be had by realizing that (setf (find-class...) nil) would be rare, i.e. few things would ever be deleted from the class table.

6:20:18 drmeister https://usercontent.irccloud-cdn.com/file/PwKZSam8/image.png

6:20:29 beach Monday mornings are crazy around here, and I need to leave. But I'll give it some thought during the day.

6:21:26 drmeister No problem - I was asking because maybe you thought about this already.

6:21:45 beach No, but I will.

6:22:05 drmeister This looks interesting - reading... https://en.wikipedia.org/wiki/Read-copy-update

6:22:18 beach I never considered parallel compilations.

6:24:02 drmeister The other big red flag is now I've got 40 threads grinding out HIR transformations and the process never uses more than %150 CPU

6:28:02 drmeister And almost every thread is calling find-class

6:34:16 beach Wait as sec...

6:34:23 beach Is this AST-to-HIR?

6:34:57 beach Typically, no classes are added to the database during that phase, so a single-writer-multiple-reader thing would work.

6:52:26 drmeister It is AST-to-HIR - and no classes are added to the database at that phase.

6:53:20 drmeister I just ran an experiment - I launched 10 threads that all loop and call (find-call 'double-float).

6:53:27 drmeister It never goes over 133% CPU.

6:53:48 drmeister And sampling the stack - it looks exactly like compile-file-parallel.

6:54:31 drmeister https://usercontent.irccloud-cdn.com/file/Jx2grtA6/image.png

6:55:21 drmeister https://www.irccloud.com/pastebin/izpbgVjm/

6:57:23 beach So does my suggestion sound right then?

6:57:54 drmeister I'm watching a lecture on Read-Copy-Update.

6:58:20 drmeister I have a multiple reader, single writer lock already - is that the "so a single-writer-multiple-reader thing would work."?

6:58:35 beach I see.

6:58:50 beach So the lock itself is the problem then?

6:59:08 beach That doesn't sound right to me, unless your locks are very slow.

6:59:17 drmeister i think so - according to this lecture the multiple-reader/single-writer solution is very limited.

6:59:54 drmeister I'm using pthreads - I get what I get.

7:00:49 beach OK, I'll think about better solutions.

7:01:40 drmeister I'm doing some research here myself - this is a known problem. The linux kernel uses something called Read-Copy-Update to speed up reads at the expense of writes.

7:02:29 beach OK, good luck.

7:03:21 beach One thing to do is to figure out why find-class is called during compilation.

7:03:38 drmeister That I can answer...

7:03:38 beach I can't figure out why that would be the case.

7:03:54 beach OK.

7:04:19 drmeister https://usercontent.irccloud-cdn.com/file/QZfqj4Ul/image.png

7:05:11 beach I don't know what I am looking at.

7:05:51 drmeister https://github.com/Bike/SICL/blob/master/Code/Cleavir/HIR-transformations/eliminate-catches.lisp#L7

7:05:58 drmeister That's essentially a backtrace.

7:06:13 drmeister eliminate-catches calls TYPEP

7:06:53 beach So change that to invoke catch-instruction-p, defined as a generic function.

7:07:11 drmeister Good idea.

7:08:42 beach Maybe one day we will implement typep with constant type descriptors as a generic function, but we haven't done that yet.

7:17:36 drmeister There are more of these - I'm changing them to predicates...

7:17:37 drmeister https://github.com/Bike/SICL/blob/master/Code/Cleavir/Intermediate-representation/map-instructions.lisp#L32

7:17:48 drmeister Now I know what I'm looking for.

7:17:55 beach Great!

7:19:37 drmeister This is what I add - right?

7:19:39 drmeister https://www.irccloud.com/pastebin/YmdFJyBP/

7:19:47 drmeister Whoah - enclose-instruction-p

7:19:58 beach yes, -p

7:20:14 beach Or you can use the :method option for defgeneric here.

7:21:19 drmeister Weren't you and Bike talking about this a while ago? Replacing typep calls with predicates?

7:21:42 beach Yes, I think so. And doing it automatically.

7:21:51 beach Same idea as with MAKE-INSTANCE.

7:42:51 drmeister There are quite a few of these.

7:43:00 drmeister But they are easy to spot now.

8:09:47 drmeister Yeah - the read-many/write-one lock is totally inadequate here. Multiple CPU's trying to grab the read lock is bad. The memory that represents the read lock can only be held by one core at a time and so it bounces between the CPU's.

8:10:06 drmeister https://www.youtube.com/watch?v=BcAED2f3z0I

8:10:22 drmeister The interesting part starts at about 20 min

8:12:42 drmeister The problem gets worse the more CPU's there are fighting for the lock.

8:14:08 beach I see.

15:38:11 drmeister Hello everyone.

15:38:21 drmeister I had an epiphany last night.

15:39:23 drmeister Avoid using TYPEP in code that you want to run in parallel.

15:40:44 drmeister After converting about a dozen invocations of TYPEP to xxxx-p predicates in the AST->HIR->native code, compile-file-parallel gets up to using 450% CPU.

15:41:05 beach I think it is a problem only if the types in question are standard classes.

15:41:25 beach Maybe some other cases as well.

15:43:44 drmeister You mean - because TYPEP uses predicates internally for many classes?

15:44:07 drmeister That makes sense - and yes, it's all (TYPEP x 'cleavir-class)

15:44:14 beach Yes, like character, fixnum, etc.

15:44:45 drmeister Right.

15:45:27 beach Also, like I said, TYPEP with a constant type descriptor can be implemented efficiently without calling FIND-CLASS.

15:45:41 beach It would take some trickery.

15:46:00 drmeister Using TYPEQ right?

15:46:36 beach No, TYPEQ works efficiently only for a small number of types.

15:46:57 drmeister What is the trickery?

15:47:25 beach Secretly creating a generic function for some types, much like the ones you created manually.

15:47:34 Bike i still don't think find-class should cause contention

15:48:13 beach Bike: It appears to be the lock itself used in the traditional solution to multiple-reader/single-writer problems.

15:48:28 Bike yeah, but then there shouldn't be a problem, right? it's not like we're writing

15:49:17 Bike also, did you see what i said about the hir stuff in #sicl?

15:49:25 drmeister Yeah. I watched a lecture on it last night. Acquiring a read lock involves writing to shared memory. That forces exclusive access on that memory for one CPU and invalidates caches in all the other CPU's.

15:49:29 Bike also did everyone in new england survive the cold snap because fucking hell it's cold

15:49:32 beach Right, but there must be a short lock to make sure there are no writers. That one appears to be the problem.

15:49:39 Bike oh. that's a bummer.

15:49:45 drmeister So the exclusive access to the memory for the read lock ping-pongs around the different CPU's.

15:50:13 Bike so we can't actually have multiple simultaneous readers.

15:50:17 Bike that really blows.

15:50:31 beach Bike: Yes, I saw it. I think we are in agreement.

15:50:34 Bike ok, great.

15:50:47 Bike it's the calls i'm worried about, that's a fairly deep change

15:51:26 beach Bike: You are right that there may be no control path from the first instruction of a TAGBODY to the last one, but if you don't include some special kind of edge, aren't we going to have problems with ownership calculations again?

15:51:38 drmeister We can have multiple simultaneous readers.

15:52:00 beach I am still surprised those locks take that long.

15:52:28 beach I wonder whether a solution based on CAS or SLE could work.

15:52:32 drmeister There something called Read-Copy-Update that they start to describe here: https://www.youtube.com/watch?v=BcAED2f3z0I

15:52:59 beach Bike: The calls?

15:53:08 drmeister It's used in the Linux kernel for this purpose - lots of readers, occasional updating and you want reading to be very fast.

15:53:09 Bike beach: I mean say we have (progn (tagbody loop ... (go loop)) (foo ...)) - the foo call is never actually reached so we don't have to worry about who owns it cos it's deleted. but in (progn (tagbody ... end) (foo ...)) the end tag will just have the foo call as a normal successor. so everything should be fine.

15:53:28 drmeister That lecture I just posted illustrates the problem with multiple-reader/writer locks.

15:53:30 Bike beach: passing the dynamic environment in calls

15:54:02 beach There might be a lot to change, sure.