freenode/#clasp - IRC Chatlog
Search
15:15:17
drmeister
I'm trying to improve cando's subgraph isomorphism matching - has anyone done anything like that before
15:18:24
drmeister
On the other hand I have a partial implementation of the algorithm that bails out after the first hit and I need it to find all the solutions.
15:19:08
Bike
it's the same problem, though there might be some practical improvements possible by relying on the peculiar graph structure of molecules
15:21:03
beach
In the article "Beyond worst-case analysis" in this month's CACM, the author thinks that many problem that are known to be intractable (in the CS sense of the word) are hard only for instances that don't matter.
15:23:13
Bike
it says a subgraph G0 has all vertices that are vertices of the greater graph, and edges that are all edges of the greater graph between vertices that are in the subgraph
15:24:07
Bike
then the rest of the paragraph defines the subgraph isomorphism problem as finding bijections between the search graph's and the subgraph's vertices and edges.
15:25:06
drmeister
Bike - you know our smarts code? Well, it's not ready for smirnoff because it doesn't enumerate all the solutions when you search from a particular atom.
15:26:34
drmeister
The smarts code is reduced to a decision tree that you test by starting on a particular atom.
15:30:31
Bike
smarts is probably a slightly harder problem because you also care about what elements the atoms are and so on.
15:30:42
Bike
though maybe that actually makes it easier since you can rule out possibilities faster.
15:33:46
Bike
maybe i could work on it? i have enough CS education to read the math and enough chemistry to read that.
15:34:39
drmeister
My problem is it doesn't iterate through all solutions, it finds the first match and returns.
15:36:11
drmeister
If you start it on N2 it will bail out immediately because the head of Chain(C,...) is a C and not an N.
15:37:33
drmeister
Right - Chain(head,tail) . head is an atom test. tail is a Chain or a Branch or NIL
15:44:04
drmeister
The current code does a search with backtracking - it's fine for saying - yes - this atom is in a subgraph that matches the pattern - that's fine for atom type assignment.
15:44:59
drmeister
I need to change the way the search goes so that I can iterate through the solutions...
15:46:27
drmeister
Hmm, right now the contains the state of the search. I need to get the state of the search into a vector that stores the values that are currently on the stack.
15:55:29
drmeister
So test #2 starting on N2 should return N2C4C5, N2C6C7 and N2C8C9 and starting on N3 return N3C2C3 and everything else should fail
15:56:21
drmeister
Right now, test#2 on N3 will return N3C2C3 and on N2 will arbitrarily return one of the solutions N2C4C5, N2C6C7 or N2C8C9 and false starting on any other atom.
15:58:07
drmeister
Not that you need to look at it but the code that does this is in chemInfo.cc https://github.com/drmeister/cando/blob/dev/src/chem/chemInfo.cc#L1558 and https://github.com/drmeister/cando/blob/dev/src/chem/chemInfo.cc#L1621
15:58:37
drmeister
Then I have these papers that sound like gobbledy gook and I can't make sense of their algorithms.
16:47:04
drmeister
http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other/
16:47:33
drmeister
https://stackoverflow.com/questions/6743894/any-working-example-of-vf2-algorithm/6744603#6744603
16:51:11
Bike
"There's just one problem: the Ullmann algorithm detects edge-induced isomporphisms. This means, for example, that if your query molecule is propane and your test molecule is cyclopropane, you won't find a match with an Ullmann-backed tool. " why is that bad
16:52:50
drmeister
I'm wracking my brains why it would be bad not to match cyclopropane (C1CCC1) given propane (CCC)
16:53:51
Bike
i mean if you just want a carbon bonded to a carbon bonded to a carbon you should be able to specify that too but propane is more specific
16:54:40
drmeister
The VF2 algorithm is described here - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.5342&rep=rep1&type=pdf
16:55:08
drmeister
To map it into what I know of the problem. I recognize a lot of it but I still don't get it (sigh).
16:58:46
Bike
well the high level description looks pretty much like what cando already has, except it appends solutions into a list instead of returning immediately.
17:03:38
drmeister
To begin with I'm still struggling with Figure 2 and Figure 3 that illustrate the Ullman algorithm.
17:04:27
drmeister
I like that this has nice pictures and a clear example - it makes me hopeful that I can figure this damn thing out and then find out how far my own implementation is from it.
17:05:55
drmeister
I have all of the code to do atom and bond matching. I generate a tree from smarts code. I have enough of the elements to find the first match.
17:07:13
drmeister
For instance, in the first box in Figure 2 - the '*' row are all 1 because every atom in heptanoic acid will match '*' (wildcard)
17:11:03
drmeister
From a pragmatic perspective - understanding Ullman doesn't appear necessary to understand VF2.
17:11:55
drmeister
But these highfalutin computer science professors waving their fancy graph theory around annoys me.
17:25:38
Bike
...yeah? that's what i mean. It's the matrix called M0 in section 2 of his paper, i believe
17:32:24
Bike
the M' matricies tell you which nodes could possibly match just by degree, and then you refine with the actual structure
17:33:54
Bike
right, an adjacency matrix encodes that information into a matrix instead of a graph data structure
17:34:55
drmeister
It's a matrix where every row and column represent an atom and a 1 at (i,j) represents there is a bond between atom i and atom j - right?
17:36:10
drmeister
Oh - yeah - I just don't see where it fits in here - other than - yes I need to know what is adjacent to what.
17:39:21
drmeister
Ah - I see where you are going with that. I'm jumping around the papers trying to find some insight while avoiding tackling Ullman's paper head on - it's nasty.
19:44:46
selwyn
drmeister: did you see this paper? linked to from the wiki page https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633016/ i found it well-written and it has a summary of the state of the art. the authors propose an algorithm 'RI' which always outperforms vf2
21:57:05
selwyn
i still get 'out of memory errors' when building cclasp in parallel.. it looks like serial is working. does anyone else still have problems?
22:00:46
drmeister
When you are building in parallel you can control the number of parallel processes that are run at the same time.
23:25:47
selwyn
it worked thanks very much. first time i started it up it complained Compile-error The variable LITERAL::*CONSTANT-DATUM-TO-LITERAL-NODE-CREATOR* is unbound. now it's fine..
23:26:40
drmeister
I've been working with Martin all day on the distributor and fixing problems with the TI calculations - I think I got it working now.
23:27:09
drmeister
We are running calculations on a combination of AWS spot instances and my desktop GPU card. It's working nicely now.
23:28:19
drmeister
It's running all of the yellow ellipses - most of them are GPU accelerated Amber.
23:28:57
drmeister
So I can punt understanding the VF2 algorithm and just use the boost::graph implementation.
23:29:12
drmeister
There is a bunch of other useful stuff in boost::graph that I have wanted to use for a while.
23:34:05
selwyn
is there an obvious reason why building should require more memory after these recent changes?