libera/#clasp - IRC Chatlog

14:16:07 drmeister It's running now. We will get a graph in 10 min or so.

14:16:34 drmeister It would be very convenient if this works. Then I wouldn't need MPI yet.

14:17:11 drmeister I have a problem in the search that gets smaller and smaller the more searching I do.

14:17:43 drmeister It's a bit difficult to describe but imagine I'm generating puzzle pieces that must connect to other puzzle pieces.

14:17:47 yitzi I seem to recall, that is how the "channels" in lparallel work. I think that there is a bug in lparallel in that the next job in the queue won't start if you don't retrieve the result waiting on the channel. But that shouldn't be a problem for PMAP. I ran into this issue for the TIRUN app when we are sketching the ligands.

14:17:57 drmeister I'm searching small combinations of puzzle pieces.

14:18:27 drmeister With a short search - of say 20 - about 1% of the puzzle pieces don't fit a following piece.

14:18:53 drmeister With a search of 200 - it's about 0.4% of the puzzle pieces don't fit a following piece.

14:22:28 drmeister I don't know much I need to search to drive that to zero - zero would be best.

14:22:58 drmeister It takes about 4 hours on 12 nodes each with 28 cores to search 200.

14:23:50 drmeister I thought it would be best to address that long tail before I do anything else.

14:25:12 yitzi makes sense.

14:25:42 drmeister Here it is 10 min in...

14:25:43 drmeister https://usercontent.irccloud-cdn.com/file/RDTWpo7x/image.png

14:26:45 drmeister The measurement for each node is not very good - or lparallel is doing crazy things.

14:26:49 drmeister Here's one of them...

14:26:50 drmeister https://usercontent.irccloud-cdn.com/file/pIva8N0a/image.png

14:27:27 drmeister That drop at 9:20am - I gotta believe it isn't real.

14:27:56 drmeister I am assuming I need to watch the trend - and the trend looks like there is still a long tail.

14:28:10 drmeister https://usercontent.irccloud-cdn.com/file/hhFi4HkW/image.png

14:36:33 drmeister Yep - not good

14:36:34 drmeister https://usercontent.irccloud-cdn.com/file/a6065YmP/image.png

14:38:13 stassats looks colorful though

14:39:09 yitzi Well, either submit the jobs to kernel yourself....or write your own threadpool?

14:41:02 yitzi There are some examples of using futures in https://github.com/cando-developers/cando/blob/0fc1fa09ee22521403bd46e1b8298f82ae2d94f5/src/lisp/cando-widgets/molecule-select.lisp

14:41:59 drmeister Did you do that?

14:42:04 drmeister So you use `eval`

14:42:11 stassats lparallel seems to be... kinda abandoned

14:42:37 drmeister Or finished?

14:42:59 yitzi drmeister: yes...that was me. You could also keep it simple https://lparallel.org/kernel/

14:43:00 stassats parallel? too complicated to ever be

14:43:25 yitzi Just submit the tasks and make sure to eventually read the results.

14:43:36 yitzi The "futures" are bit weird, IMHO.

14:43:41 stassats i only used the queues from lparallel and then did my own stuff

14:44:49 drmeister yitzi: I'll read up on the kernel API.

14:45:12 drmeister There is no doubt - the tail is still there...

14:45:13 drmeister https://usercontent.irccloud-cdn.com/file/ioIlpkoQ/image.png

14:46:26 yitzi drmeister: I am pretty sure you just add all the tasks and then just idle while waiting for the results ... which are just indicative that the job completed.

15:24:16 drmeister So you just open a channel and submit-task's to it and they automatically go the the *kernel* and then you call receive-result for each task?

15:27:21 yitzi Think so

16:30:58 drmeister I added per-node/per-thread logging and my attempt at load balancing is absolute shite.

16:31:14 yitzi oh?

16:31:37 drmeister I was sorting the jobs based on the number of atoms - figuring more atoms take more time.

16:32:34 drmeister That's not at all the case - the amount of time varies hugely. Now I suspect that some non-linear optimizations are getting trapped and I'm letting them wander too long.

16:32:47 drmeister Digging deeper.

16:34:21 drmeister Some worker threads can finish 12 jobs in the time that one takes for one job.

16:34:55 yitzi So maybe not the fault of lparallel. Hmm....

16:44:27 drmeister Right

19:35:58 stassats are the jobs independent?

19:37:10 stassats i would have made a queue of jobs from which each thread repeatedly gets a job (or a batch of jobs, if each individual one is very small)

20:03:58 drmeister The jobs are independent yes.

20:04:10 drmeister yitzi: I get this when I try to build apptainer with `:mpi t`

20:04:12 drmeister https://www.irccloud.com/pastebin/lbxd7r31/

20:04:50 drmeister It's not a burning issue - it looks like I can push MPI into the future a bit because I think I solved the issue with the tail. I had an almost infinite loop of error /error handling.

20:06:11 yitzi If mpic++ isnt in an obvious place you can specify the path with `:mpicxx <path>`

20:20:22 drmeister But where would it be? Is this in the apptainer?

20:20:47 drmeister It's on the host at `/usr/bin/mpic++`

20:22:10 yitzi No, you need it in the container. We may need to install debian packages.

20:23:35 yitzi Looks like it is libopenmpi-dev

20:24:06 yitzi If it is not already in the container then just add that to the apt-get install in the def file

20:36:17 drmeister Trying that.

22:16:02 drmeister No more long tail...

22:16:03 drmeister https://usercontent.irccloud-cdn.com/file/qGJE6dRZ/image.png

22:18:46 drmeister It was a handler that recognized 3 or 4 linear atoms (a problem for non-linear optimization) and that caught the error and tried to shake up the 3 or 4 linear atoms. It doesn't work very well probably because the rest of the structure forces the atoms back into a linear arrangement.

22:19:28 drmeister There was a potential infinite loop of handling the error and then restarting the calculation and it generating the error again. It would very occasionally knock itself out of that cycle.

22:19:39 drmeister I set it up so it only tries 3 times and then gives up.

22:20:24 drmeister I have an MPI build in apptainer. I'm not sure how to test it though.

23:15:58 drmeister Things are going well now with load balancing.

23:15:59 drmeister https://usercontent.irccloud-cdn.com/file/OG8k6vIX/image.png

23:16:12 drmeister 76% utilization.

23:17:32 drmeister I don't know why there are so many ups and downs