freenode/#clasp - IRC Chatlog

15:14:17 drmeister Hello everyone

15:15:17 drmeister I'm trying to improve cando's subgraph isomorphism matching - has anyone done anything like that before

15:15:19 drmeister ?

15:16:00 beach That smells like an intractable problem.

15:16:38 drmeister Ha - I do intractable problems before breakfast!

15:16:50 drmeister Literally - I just got up and I haven't had breakfast

15:17:20 drmeister It's something we do in chemistry - compare subgraphs to graphs.

15:17:21 drmeister https://ieeexplore.ieee.org/document/1323804

15:17:56 drmeister I can't understand these papers - the notation they use is unfamiliar with me.

15:18:12 beach https://en.wikipedia.org/wiki/Subgraph_isomorphism_problem

15:18:24 drmeister On the other hand I have a partial implementation of the algorithm that bails out after the first hit and I need it to find all the solutions.

15:18:35 beach Though perhaps your problem is different from the one from theoretical CS.

15:19:08 Bike it's the same problem, though there might be some practical improvements possible by relying on the peculiar graph structure of molecules

15:19:31 beach But, yeah, intractability has never prevented us from doing our jobs.

15:20:00 drmeister Onward algorithmic soldiers!

15:20:24 drmeister I read that wiki page several times last night.

15:20:28 drmeister What does this mean?

15:20:39 drmeister https://usercontent.irccloud-cdn.com/file/JLVwsJ4j/image.png

15:20:59 Bike intersection with the cartesian square of the set V_0

15:21:03 beach In the article "Beyond worst-case analysis" in this month's CACM, the author thinks that many problem that are known to be intractable (in the CS sense of the word) are hard only for instances that don't matter.

15:21:26 Bike let me translate this bit into english

15:21:39 drmeister This is set notation - right?

15:22:08 Bike yes

15:22:19 Bike i mean that whole definition there is just defining what a subgraph is

15:22:25 Bike but you already know what a subgraph is

15:23:13 Bike it says a subgraph G0 has all vertices that are vertices of the greater graph, and edges that are all edges of the greater graph between vertices that are in the subgraph

15:24:07 Bike then the rest of the paragraph defines the subgraph isomorphism problem as finding bijections between the search graph's and the subgraph's vertices and edges.

15:24:11 Bike so more stuff you already know

15:25:06 drmeister Bike - you know our smarts code? Well, it's not ready for smirnoff because it doesn't enumerate all the solutions when you search from a particular atom.

15:25:18 Bike yeah, you said so before.

15:25:19 drmeister It's hard coded to find the first solution and then return.

15:25:37 drmeister I'm trying to figure out how to fix it so it enumerates the solutions.

15:25:49 Bike though i'm not sure what you mean "search from a particular atom"

15:25:59 Bike like, constrain it to solutions where the subgraph includes that atom?

15:26:34 drmeister The smarts code is reduced to a decision tree that you test by starting on a particular atom.

15:28:31 Bike ok...

15:30:31 Bike smarts is probably a slightly harder problem because you also care about what elements the atoms are and so on.

15:30:42 Bike though maybe that actually makes it easier since you can rule out possibilities faster.

15:32:59 drmeister https://usercontent.irccloud-cdn.com/file/oZ5KmVnd/image.png

15:33:00 Bike harder in the sense of programming it, easier in the sense of speed

15:33:21 drmeister This is helping to remind me how this all works.

15:33:36 drmeister So that picture has two structures and a SMARTS string.

15:33:46 Bike maybe i could work on it? i have enough CS education to read the math and enough chemistry to read that.

15:33:57 drmeister The SMARTS string gets parsed into the matcher below.

15:34:19 drmeister It does work - I'm 90% there - it's been working for years.

15:34:39 drmeister My problem is it doesn't iterate through all solutions, it finds the first match and returns.

15:34:47 drmeister So bear with me.

15:35:12 drmeister I mean - I'm walking through my thinking here...

15:35:44 drmeister The Chain(...) thing, you start that on one atom at a time.

15:36:11 drmeister If you start it on N2 it will bail out immediately because the head of Chain(C,...) is a C and not an N.

15:36:30 drmeister That is what I mean by search from a particular atom.

15:37:11 Bike okay, well, that's just the implementation then.

15:37:33 drmeister Right - Chain(head,tail) . head is an atom test. tail is a Chain or a Branch or NIL

15:38:00 drmeister Branch(left,right) . left and right are either Chain or Branch's

15:42:50 drmeister If I had a coroutines this would be easy.

15:44:04 drmeister The current code does a search with backtracking - it's fine for saying - yes - this atom is in a subgraph that matches the pattern - that's fine for atom type assignment.

15:44:42 drmeister But once it returns, the stack is unwound and where it is in the search is lost.

15:44:59 drmeister I need to change the way the search goes so that I can iterate through the solutions...

15:46:27 drmeister Hmm, right now the contains the state of the search. I need to get the state of the search into a vector that stores the values that are currently on the stack.

15:46:50 drmeister Hmm, right now the STACK contains the state of the seach.

15:46:59 drmeister Damn this keyboard

15:48:42 drmeister I'm going to focus on chains first - they are simpler.

15:48:55 drmeister If I had the chain NCC

15:53:10 drmeister https://usercontent.irccloud-cdn.com/file/eshrXXFz/image.png

15:54:19 drmeister I switched to S-expressions, rather than mathematical notation.

15:54:26 drmeister ACTION hates commas

15:55:29 drmeister So test #2 starting on N2 should return N2C4C5, N2C6C7 and N2C8C9 and starting on N3 return N3C2C3 and everything else should fail

15:56:21 drmeister Right now, test#2 on N3 will return N3C2C3 and on N2 will arbitrarily return one of the solutions N2C4C5, N2C6C7 or N2C8C9 and false starting on any other atom.

15:58:07 drmeister Not that you need to look at it but the code that does this is in chemInfo.cc https://github.com/drmeister/cando/blob/dev/src/chem/chemInfo.cc#L1558 and https://github.com/drmeister/cando/blob/dev/src/chem/chemInfo.cc#L1621

15:58:37 drmeister Then I have these papers that sound like gobbledy gook and I can't make sense of their algorithms.

16:00:40 drmeister Ha - I just found one with pictures - that looks interesting.

16:00:55 drmeister https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586954/#B12

16:04:03 drmeister I'm pretty sure I'm doing a brute force, depth first search

16:47:02 drmeister The "VF2" algorithm seems to be the best.

16:47:04 drmeister http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other/

16:47:16 drmeister https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586954/#B12

16:47:33 drmeister https://stackoverflow.com/questions/6743894/any-working-example-of-vf2-algorithm/6744603#6744603

16:49:14 drmeister These descriptions are all so shitty.

16:51:11 Bike "There's just one problem: the Ullmann algorithm detects edge-induced isomporphisms. This means, for example, that if your query molecule is propane and your test molecule is cyclopropane, you won't find a match with an Ullmann-backed tool. " why is that bad

16:51:23 Bike they're not isomorphic

16:52:50 drmeister I'm wracking my brains why it would be bad not to match cyclopropane (C1CCC1) given propane (CCC)

16:53:05 drmeister I can't come up with a reason.

16:53:51 Bike i mean if you just want a carbon bonded to a carbon bonded to a carbon you should be able to specify that too but propane is more specific

16:53:57 drmeister I focused on the Ehrlman paper that says that VF2 is faster than Ullman

16:54:40 drmeister The VF2 algorithm is described here - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.5342&rep=rep1&type=pdf

16:54:47 drmeister I'm trying to make heads or tails of it.

16:55:08 drmeister To map it into what I know of the problem. I recognize a lot of it but I still don't get it (sigh).

16:55:14 Bike oh, this isn't well written.

16:56:19 drmeister https://www.youtube.com/watch?v=tO5sxLapAts

16:56:39 Bike there are typos, even

16:58:46 Bike well the high level description looks pretty much like what cando already has, except it appends solutions into a list instead of returning immediately.

16:59:12 Bike the feasibility function and P might be different, i guess

16:59:23 drmeister I'm getting a lot more out of this paper that compares Ullman to VF2

16:59:24 drmeister https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586954/#B12

16:59:32 drmeister I don't know if you can see the full test and pictures.

16:59:49 Bike pubmed is full text.

16:59:55 drmeister Cool

17:00:04 Bike hooray for the federal government

17:03:38 drmeister To begin with I'm still struggling with Figure 2 and Figure 3 that illustrate the Ullman algorithm.

17:04:27 drmeister I like that this has nice pictures and a clear example - it makes me hopeful that I can figure this damn thing out and then find out how far my own implementation is from it.

17:04:44 Bike does cando's implementation involve a bit matrix?

17:04:50 drmeister No

17:04:55 Bike probably pretty far then.

17:05:10 drmeister VF2 doesn't appear to use a bit matrix either.

17:05:49 Bike the pictures are nice but this isn't a full explanation of either algorithm.

17:05:55 drmeister I have all of the code to do atom and bond matching. I generate a tree from smarts code. I have enough of the elements to find the first match.

17:06:24 drmeister Agreed - I'm crawling towards understanding.

17:06:50 Bike the ullman paper is 1976 and ACM which makes it annoying to get

17:07:13 drmeister For instance, in the first box in Figure 2 - the '*' row are all 1 because every atom in heptanoic acid will match '*' (wildcard)

17:07:14 Bike it doesn't seem to involve anything like a first atom, though

17:08:30 drmeister Ullman's paper is bleh (IMHO)

17:09:33 drmeister Pseudo code like this:

17:09:34 drmeister https://usercontent.irccloud-cdn.com/file/CbeJM3Pm/image.png

17:09:47 Bike 70s computer science, woo

17:09:50 drmeister And useful figures like this...

17:10:00 drmeister https://usercontent.irccloud-cdn.com/file/uNAsq6mb/image.png

17:10:09 Bike oh, at least it's actually specifically chemically oriented

17:11:03 drmeister From a pragmatic perspective - understanding Ullman doesn't appear necessary to understand VF2.

17:11:15 Bike maybe not

17:11:55 drmeister But these highfalutin computer science professors waving their fancy graph theory around annoys me.

17:12:16 Bike well, you'll probably have to figure it out, given that it is the actual problem

17:12:24 Bike do you know stuff like what an adjacency matrix is?

17:23:19 drmeister I think so. I'm asking myself "how would I create the first panel here...

17:23:29 drmeister https://usercontent.irccloud-cdn.com/file/n0YBh1zV/image.png

17:23:48 Bike well that's something different

17:23:50 Bike that's the M' matrix in the paper

17:24:19 Bike or well, M0 i think.

17:25:04 drmeister Well, this is Ullman we are talking about - does it have M0

17:25:38 Bike ...yeah? that's what i mean. It's the matrix called M0 in section 2 of his paper, i believe

17:27:12 drmeister F*ck that paper is hard to read.

17:32:24 Bike the M' matricies tell you which nodes could possibly match just by degree, and then you refine with the actual structure

17:32:27 Bike i think

17:33:09 drmeister What I have are bonds on atoms - I know from that what is adjacent to what.

17:33:54 Bike right, an adjacency matrix encodes that information into a matrix instead of a graph data structure

17:33:58 Bike that's different from the M' matrix though

17:34:55 drmeister It's a matrix where every row and column represent an atom and a 1 at (i,j) represents there is a bond between atom i and atom j - right?

17:35:01 Bike yeah.

17:35:15 drmeister I can build one - no problem.

17:35:41 Bike sure. i was just wondering if you were familiar with fundamental concepts like that.

17:36:10 drmeister Oh - yeah - I just don't see where it fits in here - other than - yes I need to know what is adjacent to what.

17:36:34 Bike just cos it's one of the things ullman talks about without explaining.

17:39:21 drmeister Ah - I see where you are going with that. I'm jumping around the papers trying to find some insight while avoiding tackling Ullman's paper head on - it's nasty.

17:44:33 drmeister Ullman has a much more recent paper on the algorithm

18:06:57 drmeister Hmm, there is a vf2 implementation in boost::graph

19:24:26 selwyn hi all

19:44:46 selwyn drmeister: did you see this paper? linked to from the wiki page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633016/ i found it well-written and it has a summary of the state of the art. the authors propose an algorithm 'RI' which always outperforms vf2

19:45:53 selwyn they maintain an implementation here https://github.com/InfOmics/RI-DS

21:57:05 selwyn i still get 'out of memory errors' when building cclasp in parallel.. it looks like serial is working. does anyone else still have problems?

21:57:26 drmeister How much memory do you have?

21:57:32 drmeister 16GB?

21:58:09 selwyn yes 16gb

22:00:46 drmeister When you are building in parallel you can control the number of parallel processes that are run at the same time.

22:00:58 drmeister Try ./waf build_cboehm -j4

22:01:07 drmeister It may need a space between j and 4

22:01:17 selwyn ah had not thought of building with fewer *sigh*

23:25:47 selwyn it worked thanks very much. first time i started it up it complained Compile-error The variable LITERAL::*CONSTANT-DATUM-TO-LITERAL-NODE-CREATOR* is unbound. now it's fine..

23:26:40 drmeister I've been working with Martin all day on the distributor and fixing problems with the TI calculations - I think I got it working now.

23:27:09 drmeister We are running calculations on a combination of AWS spot instances and my desktop GPU card. It's working nicely now.

23:27:29 drmeister It involves 132 GPU accelerated jobs to run one calculation.

23:27:52 drmeister https://usercontent.irccloud-cdn.com/file/CvMA0ZvB/graph.dot.pdf

23:28:19 drmeister It's running all of the yellow ellipses - most of them are GPU accelerated Amber.

23:28:39 drmeister I also got boost::graph hooked into Cando.

23:28:57 drmeister So I can punt understanding the VF2 algorithm and just use the boost::graph implementation.

23:29:12 drmeister There is a bunch of other useful stuff in boost::graph that I have wanted to use for a while.

23:32:58 selwyn the cl-vulkan example works!

23:33:28 drmeister Cool!

23:34:05 selwyn is there an obvious reason why building should require more memory after these recent changes?

23:34:19 drmeister No.

23:34:29 drmeister How recent?