libera/#clasp - IRC Chatlog
Search
13:02:29
yitzi
drmeister: It doesn't seem like anything special is required during the build or invocation of Cando for this. Sounds like the main issue is making sure the host and container have compatible MPI implementations. If they do it could be that you could just skip building a custom MPI in the container.
13:03:39
yitzi
To start I would look to see what MPI version/implementation is on the cluster. A minimal build would just be adding :mpi t to the config.sexp and seeing if works.
13:34:34
drmeister
I could create a vector of say 20 entries and fill it with futures and check them every 100 milliseconds to see what future is complete and then put a new one in there.
13:35:54
drmeister
If I were doing this myself I would use bordeaux threads and a queue and write my own thread pool that keeps taking jobs out of the queue and running them. That I understand.
13:39:08
drmeister
Hmm, maybe I can treat a future like a worker in a thread pool. I use a queue and each future checks the queue and does whatever work it gets.
13:40:30
drmeister
I tried to do load balancing by sorting the jobs in order of size, largest to smallest.
13:42:57
drmeister
I haven't dug into it too deeply but it looked like pmap was ignoring my sorted order.
13:47:42
yitzi
Are you sure you really need futures? Those are more for building a calculation expression in which the pieces are "in process" ... guess I'd have to know what you are doing in more detail.
13:51:07
yitzi
And "checking the future to see which one is complete" ... isn't that what lparallel's worker pool is supposed to do?
13:52:30
drmeister
I'm not sure at all. I'm musing aloud and interested in your and Bike's thoughts.
13:53:31
drmeister
In compile-file-parallel I have a thread pool and a write-one/read-many queue - I understand that pattern and I've used it many times.
13:53:56
yitzi
Yeah, lparallel has a thread pool and you submit tasks to it over a channel. That is what is happening underneath pmap
13:54:11
drmeister
Each worker blocks on the queue and gets a piece of work or a message to shut down.
13:54:57
drmeister
The queue manager puts pieces of work on into the queue and when there are no more, one shut down message for each worker.
13:58:42
drmeister
I set up 12 nodes each with 28 threads and at the start each node has a list of longish jobs and I just use `lparallel:pmap` to map over them.
13:59:14
drmeister
I tried sorting the jobs in each list - but that seems to be thwarted by something `pmap`
14:01:03
yitzi
I'm look at the lparallel docs....you may need to submit the tasks yourself to preserve order.
14:02:08
yitzi
Yeah, pmap chunks the sequence based on the number of worker thread....that is definitely not what you want.
14:09:11
yitzi
You have a 4 core processor. That is the default number of parts it break stuff up into. When you do PMAP over (j1 j2 j3 j4 j5 j6 j7) then C1 gets (j1 j2), C2 gets (j3 j4), C3 gets (j5 j6) and C4 gets (j7) ....
14:10:13
yitzi
If you do :parts 7 then C1 gets (j1), C2 gets (j2), C3 gets (j3), C4 gets (j4), in the queue goes (j5), (j6), and (j7)
14:14:23
drmeister
The default for :parts is the number of workers. Saying `:parts (length x)` would give one job to each worker if there were `(length x)` workers.
14:15:20
yitzi
Yes, I am assuming that if the length is greater then the number of workers then lparallel will queue them
14:17:11
drmeister
I have a problem in the search that gets smaller and smaller the more searching I do.
14:17:43
drmeister
It's a bit difficult to describe but imagine I'm generating puzzle pieces that must connect to other puzzle pieces.
14:17:47
yitzi
I seem to recall, that is how the "channels" in lparallel work. I think that there is a bug in lparallel in that the next job in the queue won't start if you don't retrieve the result waiting on the channel. But that shouldn't be a problem for PMAP. I ran into this issue for the TIRUN app when we are sketching the ligands.
14:18:27
drmeister
With a short search - of say 20 - about 1% of the puzzle pieces don't fit a following piece.
14:18:53
drmeister
With a search of 200 - it's about 0.4% of the puzzle pieces don't fit a following piece.
14:26:45
drmeister
The measurement for each node is not very good - or lparallel is doing crazy things.
14:27:56
drmeister
I am assuming I need to watch the trend - and the trend looks like there is still a long tail.
14:41:02
yitzi
There are some examples of using futures in https://github.com/cando-developers/cando/blob/0fc1fa09ee22521403bd46e1b8298f82ae2d94f5/src/lisp/cando-widgets/molecule-select.lisp
14:42:59
yitzi
drmeister: yes...that was me. You could also keep it simple https://lparallel.org/kernel/
14:46:26
yitzi
drmeister: I am pretty sure you just add all the tasks and then just idle while waiting for the results ... which are just indicative that the job completed.
15:24:16
drmeister
So you just open a channel and submit-task's to it and they automatically go the the *kernel* and then you call receive-result for each task?
16:30:58
drmeister
I added per-node/per-thread logging and my attempt at load balancing is absolute shite.
16:31:37
drmeister
I was sorting the jobs based on the number of atoms - figuring more atoms take more time.
16:32:34
drmeister
That's not at all the case - the amount of time varies hugely. Now I suspect that some non-linear optimizations are getting trapped and I'm letting them wander too long.
19:37:10
stassats
i would have made a queue of jobs from which each thread repeatedly gets a job (or a batch of jobs, if each individual one is very small)
20:04:50
drmeister
It's not a burning issue - it looks like I can push MPI into the future a bit because I think I solved the issue with the tail. I had an almost infinite loop of error /error handling.
20:24:06
yitzi
If it is not already in the container then just add that to the apt-get install in the def file
22:18:46
drmeister
It was a handler that recognized 3 or 4 linear atoms (a problem for non-linear optimization) and that caught the error and tried to shake up the 3 or 4 linear atoms. It doesn't work very well probably because the rest of the structure forces the atoms back into a linear arrangement.
22:19:28
drmeister
There was a potential infinite loop of handling the error and then restarting the calculation and it generating the error again. It would very occasionally knock itself out of that cycle.