Monday, 11 November 2013

Improved Substructure Matching


Since it was written the UniversalIsomorphismTester (UIT) has been the goto class in the Chemistry Development Kit (CDK) to check structure identity, substructure and overlap matching. The utility works really well but derives the substructure from the more expensive MCS computation. There has been several efforts to bring faster structure matching through Ullmann [1] and the Vento-Foggia (VF) [2] algorithms. Most notably VF was used for structure matching in OrChem [3] and SMSD [4]. The original SMSD code was added to CDK but since then SMSD has been completely rewritten and optimised. I recently tried reintegrating the latest version but have had several failed attempts. 

Substructure and identity checking is fundamental and I felt it needed a more centralised home in the code base. This weekend I wrote the first parts of the API and implementations which I will discuss here. 

Lazy Matching

For brevity I won't go over how substructure matching works, Rich Apodaca has two excellent introductions on One of These Things is Not Like The Others and Substructure Search From Scratch which gives the key ideas. In the first post it is mentioned that recursion is at the heart of any implementation. Generally recursive implementations are one-or-all, that is you can get the first match or all matches.

As recursion is simply storing frames on the call stack for matching you can easily replace the recursion with a non-recursive stack which only stores each candidate matchings. Providing front end access to this allows a lazy substructure matcher which improves memory and has another benefit we'll see later.

So far I've added three matchers VentoFoggia (substructure/identity) and Ullmann (substructure) with scope for others [5].

API

The new structure matching paradigm is similar to Java regular expressions - a Pattern is created for a query and then that pattern can be tested on multiple targets. Whether the pattern is matching SMARTS queries, identity or substructure doesn't matter. Personally I've found my self parsing the query and target molecules in the wrong order with the existing matchers and using the pattern makes it impossible to get the order wrong.

IAtomContainer query   = ...;
Pattern        pattern = Ullmann.findSubstructure(query);
for (IAtomContainer target : targets) 
    if (pattern.matches(target))
        hits++;
With language features due in Java 8 the matching hits can even be done in one line.
List<IAtomContainer> hits = targets.filter(Ullmann.findSubstructure(query)::matches);
The first atom-mapping can be obtained with match() and all atom mappings with matchAll() (which is lazy).
for (IAtomContainer target : targets) 
    for (int[] mapping : pattern.matchAll(target))
        hits++;

Counting Self Matches 

To test the performance I first looked at counting self matches. You'd generally want to do this with a partition refinement procedure (see. Using the CDK's group module) but it serves as a good test for the matchers. To compare the new matchers I've used the CDK's UniversalIsomorphismTester (UIT) and the latest SMSD's Substructure (VF implementation). It should be noted that SMSD does go a lot faster with an API change (see. Benchmarking Substructure Search) but this has not yet been added.
ferrocene as 200 isomorphisms to it's self
The average (r=10) time taken to match and count isomorphisms (n) was tested on several symetric structures. All structures were normalised to only have sigma bonds. The SMSD implementation did not get the correct number of matches which I think is a bug propagated from the original ChemKit implementation. If any matcher took longer than 10 minutes I considered it as timed out (shown as n/a).

Structure Matched UIT SMSD Ullmann VentoFoggia
nt (s) t (s) t (s) t (s)
ferrocene 2000.281 0.166 (n=180?) 0.007 0.001
fullerene C60 120279 7.5 (n=1) 1.2 0.017
fullerene C70 60n/a 9.8 (n=1) 2 0.058
fullerene C320 120n/a n/an/a 0.527

VentoFoggia only took half a second to find all matches in fullerene C320. I was curious to see how far it could go. The new matchers allow disconnected structures and matching multiple cyclopropanes we quickly reach a huge number of matches.

C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 has more than 33 million isomorphisms
Structure Matched Ullmann VentoFoggia
(SMILES) nt (s) t (s)
C1CC1 6 0 0
C1CC1.C1CC1 72 0.002 0.002
C1CC1.C1CC1.C1CC1 1,296 0.002 0.005
C1CC1.C1CC1.C1CC1.C1CC1 31,104 0.168 0.037
C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 933,120 6.5 1.1
C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 33,592,320 270.5 36.3

Dataset Searching

As a more general gauge of performance the NCI Aug 2000 (~250,000 structures) was searched for structures containing O=C1C=CCC=C1. It's a single isolated test so conclusions can't really be drawn but it provides a sense for the timings. 
Query structure matched against ~250,000 structures in the NCI Aug 2000 data set.
The SMSD uses heuristics internally whilst the others were run without heuristics on every structure (even if the query was larger than the target). How long the matching took for each implementation is shown below. When finding hits we can find the first match in each structure (3,623 total) or find all the hits (12,706 total). The SMSD matcher is fast when finding the first matching in each structure but much slower to find all matches.

Implementationt firstt all
UIT43.4 s43.6 s
SMSD28.6 s4 m 40 s
Ullmann6.3 s6.5 s
VentoFoggia5.1 s4.9 s

Stereochemistry Matching

The new matchers also check the tetrahedral and geometric (double-bond) stereo configurations are consistent. Technically it's more efficient to check stereo configurations as the vertices are mapped but I decided to go for a post-matching check. This is more modular and also means we can use the same stereo consistency checking for other matchers (the existing UIT). This is where the lazy matching is really useful as we can get the first match which is stereochemically valid without finding all matchings.The stereo checking also seems to be fast enough and I've left it on by default (including all times in this post).

Using the VentoFoggia implementation, it takes ~38 seconds to find all 4,306,056 matches for C(C)(C)(C)(C) in ChEMBL 17. Matching one with a stereo configuration ([C@](C)(C)(C)(C)) is actually just as quick, taking ~37 seconds to find 1,007,652 matches.

Finally I'll leave you with a sneak peak at the Smiles class I have planned which makes loading and matching molecules from SMILES concise and elegant.
Pattern pattern = VentoFoggia.findSubstructure(Smiles.fromString("[C@](C)(C)(C)(C)"));
for (IAtomContainer target : Smiles.fromPath("/db/chembl/chembl_17.smi.gz")) {
    hits += Iterables.size(pattern.matchAll(target)); // Guava utils
}

Code

The matcher code is currently awaiting review on the branch feature/isomorphisms. If the branch is missing then it will be in the package org.openscience.cdk.isomorphism.

References

1. Ullmann JR. An Algorithm for Subgraph Isomorphism. Journal of the Association for Computing Machinery. 2004 
2. Cordella LP et al. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2004
3. Rijnbeek M and Steinbeck C. OrChem - An open source chemistry search engine for Oracle. Journal of Cheminformatics. 2009
4. Rahman SA et al. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics. 2009
5. Gouda K and Hassaan, M. A fast algorithm for subgraph search problem. Informatics and Systems (INFOS). 2012