Efficient Bits: 2013

Wednesday 18 December 2013

New SMILES behaviour - parsing (CDK 1.5.4)

It's been a while since the last CDK development release (over 4 months). The CDK 1.5.4 will be released in the next few days and has a lot of improved functionality. This is the first of several posts expanding on the functionality and detailing some design decisions.

One major overhaul has been the handling of the SMILES format. Mainly, tetrahedral and double-bond stereo-chemistry is read and written but they'll be more on that in the release note. In general the SMILES parsing API is the same as older versions but there is some behaviour changes which are worth highlighting to avoid possible "gotchas" with SMILES.

The previous parser behaviour was to create a structure, perform atom typing and add hydrogens using the CDK atom types. Bonds in aromatic SMILES were left as single bonds with aromatic flags. Aromaticity was optionally re-perceived. Leaving aromatic bonds with correct bond orders caused problems elsewhere in the toolkit.

Atom-typing / Implicit hydrogen counts

Atom-typing is no longer performed, this mirrors the behaviour of the other CDK readers and avoids atom-typing being invoked twice. As a significant overhead the atom-typer should be optional (although it is often required). If you read SMILES and use a procedure that requires the CDK atom-types you must now explicitly invoke the atom-type configuration.

Although atom-types are not automatically set, the implicit hydrogen count is. The hydrogen count follows the Daylight SMILES specification allowing the formula to be precisely determined. If the atom-typing / hydrogen adding is redone with the CDK you may lose information. As is shown below, removing the automatic atom-typing allows better round-tripping.

IChemObjectBuilder     builder = SilentChemObjectBuilder.getInstance();
SmilesParser           sp      = new SmilesParser(builder);
SmilesGenerator        sg      = new SmilesGenerator();
InChIGeneratorFactory  igf     = InChIGeneratorFactory.getInstance();
        
IAtomContainer m = sp.parseSmiles("[O]");
System.out.println(sg.create(m));                         // [O]
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/O
        
// configure atom types
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(m);
CDKHydrogenAdder.getInstance(builder).addImplicitHydrogens(m);

System.out.println(sg.create(m));                         // O ([OH2])
System.out.println(igf.getInChIGenerator(m).getInchi());  // InChI=1S/H2O/h1H2

Kekulé representation

Reading an aromatic SMILES will now provide a structure with defined bond orders. The kekulisation is efficient and correctly handles some tough cases. Example inputs and depictions are shown below. There will be a future post describing the algorithm in detail.

Input	Depiction
`c1(c)c(c)c(c)c(c)c(c)c1(c)`
`c1cccc1c1cccc1`
`oc1ccc(o)cc1`
`oc1ccocc1`
`oc(cc)co`

If a structure could not be kekulised it is considered invalid and a checked exception is thrown. The most common cause of errors is omitting a hydrogen present on an aromatic nitrogen (or phosphorus).

Input	Depiction
`n1cccc1`	Error - could not kekulise
`[nH]1cccc1`

Why are invalid SMILES not "fixed"?

The decision to consider such SMILES invalid may be controversial. The primary issue is there is no guaranteed way to unambiguously handle the input such the intended structure was obtained.

To kekulise the pyrrole example, n1cccc1, we would need to change the formula of the structure by adding a hydrogen ([nH]1cccc1) or assign a negative charge to the nitrogen ([n-]1cccc1). Both alter the SMILES string by the same number of edits, such decisions can lead to ambiguities.

In the case of pyrrole the decision seems obvious and we would arrive at the correct representation. Most would favour adding a hydrogen rather than modifying the charge (although positive charges are frequently assigned to 4 valent nitrogens in normalisation). In general the placement of the hydrogen is not always as trivial.

Fixarom is a program available in the Daylight Contrib that attempts to modify SMILES to allow parsing (kekulisation). In effect it tries every possible combination of hydrogen placements and is how most resolutions would work. Apart from the obvious inefficiency the structure we end up with will be different depending on the input order. This is because SMILES is tautomer specific, that is, different tautomers have different SMILES.

Input	Depiction
`CC(C)C1=CC=CC2=C1NC=N2` or `CC(C)c1cccc2c1[nH]cn2`
`CC(C)C1=CC=CC2=C1N=CN2` or `CC(C)c1cccc2c1nc[nH]2`

I'm a huge fan and regularly use the Open Babel ready-to-use programs, recently I found that the obabel utility will automatically add hydrogens to aromatic nitrogens. As seen below, the input only differs in the atom order. We generate two different canonical SMILES because the output structures are different tautomers.

obabel -:'CC(C)c1cccc2ncnc12' -ocan
CC(c1cccc2c1nc[nH]2)C 
1 molecule converted
obabel -:'CC(c1cccc2c1ncn2)C' -ocan
CC(c1cccc2c1[nH]cn2)C
1 molecule converted

It would be possible to assign a single representation by choosing the canonical tautomer. There is however no guarantee that the representation is the same as was used to generate the original SMILES.

Rather than the choice of hydrogen placement it may be that there really should be no hydrogen placed. CHEBI:32730 which currently has the SMILES NC(Cc1c[n]c2ccccc12)C(O)=O serves as a good example. Following our decision on pyrrole we would incorrectly add a hydrogen to the aromatic nitrogen when it was actually a radical. We could use the brackets as a trigger to indicate that there really should be no hydrogen but then we still can't assign bond orders. I believe this error in ChEBI is actually due to the incorrect use of the 'loose' instead of 'general' aromaticity option in Marvin. Again, we can obtain two different structures depending on our interpretation. In this case it is the less likely one which is correct.

Input	Depiction
`NC(Cc1c[n]c2ccccc12)C(O)=O` (interpreted as `NC(Cc1c[nH]c2ccccc12)C(O)=O`)
`NC(Cc1c[n]c2ccccc12)C(O)=O` (interpreted as `NC(Cc1c[N]c2ccccc12)C(O)=O`)

As you can see resolving missing hydrogens in aromatic SMILES is ambiguous. In practise providing the SMILES were generated correctly the number of invalid strucutres is small. There are fewer than 40 invalid inputs between ChEBI and ChEMBL and generally they involved organometallics incorrectly represented in SMILES with non-covalent bonds specified as covalent.

If you still really want to parse these structures then 'kekulise' can be set to false but this is not recommended.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // throws exception
sp.kekulise(false);                                        // turn off kekulisation
IAtomContainer     container = sp.parseSmiles("n1cccc1");  // creates an AtomContainer

Aromaticity re-perception

Previously the option setPreservingAromaticity could be set to return atoms and bonds with the aromaticity flags matching the input. By default the option was off and the CDKHueckelAromaticityDetector was automatically invoked. The new default behaviour is to preserve the aromatic specification in the input (even if the input is not aromatic). With the bond orders now set we can safely apply a new aromaticity model if desired or simply remove the aromatic flags. The functionality followed from the removal of automatic atom-typing. The option, setPreservingAromaticity, has been deprecated.

IChemObjectBuilder builder   = SilentChemObjectBuilder.getInstance();
SmilesParser       sp        = new SmilesParser(builder);

// cyclobuta-1,3-diene is not aromatic but has been specified as such
IAtomContainer container = sp.parseSmiles("c1ccc1");

// remove flags -> molecule is still valid thanks to kekulisation
for (IAtom a : container.atoms()) 
    a.setFlag(CDKConstants.ISAROMATIC, false);
for (IBond b : container.bonds()) 
    b.setFlag(CDKConstants.ISAROMATIC, false);

Ring closure bond type

A SMILES ring closure may optionally have an explicit bond type. As recommended by Open SMILES an explicit symbol should only be written on either the opening or the closure. Writing the symbol on both the ring opening and closure is ambiguous.

A violation of this is sometimes found with up/down directional labels in double-bond stereo-chemistry. The bond labels are relative and reverse sense on ring closures. Such specification obviously occurs in rings but we can mock an example to show that C/1.C/C=C\1 is correct C/1.C/C=C/1 is incorrect. The new SMILES parser will not accept the second input and highlights where the problem was.

InvalidSMILESException: could not parse 'C/1.C/C=C/1', Ring closure bonds
did not match. Ring was opened with '/' and closed with '\'. Note -
directional bonds ('/','\') are relative.
C/1.C/C=C/1
         ^

I decided to highlight this as during testing I encountered a couple of structures with this issue. CHEMBL1315255 currently has the SMILES shown below, the incorrect labels on the ring closures are highlighted.

Cl.Cl.Cc1cccc2\C(=C/3\\Nc4ccccc4/C/3=N\OCCN5CCNCC5)\C(=O)Nc12 1315255

Generally if a parser accepts this input it may be choosing either to use the ring opening or ring closure. A simple test to determine if the opening or closing is use is to parse C=1CCCC#1 and C#1CCCC=1. The choice of whether to prefer the open or close label differs between toolkits and I don't think there is a correct answer.

I would like to thank the valuable input and discussions from the CDK developers, EBI colleagues and CCN attendees. I have previously notified the ChEBI and ChEMBL curators of the problematic SMILES identified.

I'll leave you with a fun and seasonal example of valid SMILES input.

HoHoHo Merry Christmas (Daylight DEPICT)

Monday 11 November 2013

Improved Substructure Matching

Since it was written the UniversalIsomorphismTester (UIT) has been the goto class in the Chemistry Development Kit (CDK) to check structure identity, substructure and overlap matching. The utility works really well but derives the substructure from the more expensive MCS computation. There has been several efforts to bring faster structure matching through Ullmann [1] and the Vento-Foggia (VF) [2] algorithms. Most notably VF was used for structure matching in OrChem [3] and SMSD [4]. The original SMSD code was added to CDK but since then SMSD has been completely rewritten and optimised. I recently tried reintegrating the latest version but have had several failed attempts.

Substructure and identity checking is fundamental and I felt it needed a more centralised home in the code base. This weekend I wrote the first parts of the API and implementations which I will discuss here.

Lazy Matching

For brevity I won't go over how substructure matching works, Rich Apodaca has two excellent introductions on One of These Things is Not Like The Others and Substructure Search From Scratch which gives the key ideas. In the first post it is mentioned that recursion is at the heart of any implementation. Generally recursive implementations are one-or-all, that is you can get the first match or all matches.

As recursion is simply storing frames on the call stack for matching you can easily replace the recursion with a non-recursive stack which only stores each candidate matchings. Providing front end access to this allows a lazy substructure matcher which improves memory and has another benefit we'll see later.

So far I've added three matchers VentoFoggia (substructure/identity) and Ullmann (substructure) with scope for others [5].

API

The new structure matching paradigm is similar to Java regular expressions - a Pattern is created for a query and then that pattern can be tested on multiple targets. Whether the pattern is matching SMARTS queries, identity or substructure doesn't matter. Personally I've found my self parsing the query and target molecules in the wrong order with the existing matchers and using the pattern makes it impossible to get the order wrong.

Code 1 - Matching API predicate

IAtomContainer query   = ...;
Pattern        pattern = Ullmann.findSubstructure(query);
for (IAtomContainer target : targets) 
    if (pattern.matches(target))
        hits++;

With language features due in Java 8 the matching hits can even be done in one line.

List<IAtomContainer> hits = targets.filter(Ullmann.findSubstructure(query)::matches);

The first atom-mapping can be obtained with match() and all atom mappings with matchAll() (which is lazy).

Code 1 - Matching API, lazy loop over all matches

for (IAtomContainer target : targets) 
    for (int[] mapping : pattern.matchAll(target))
        hits++;

Counting Self Matches

To test the performance I first looked at counting self matches. You'd generally want to do this with a partition refinement procedure (see. Using the CDK's group module) but it serves as a good test for the matchers. To compare the new matchers I've used the CDK's UniversalIsomorphismTester (UIT) and the latest SMSD's Substructure (VF implementation). It should be noted that SMSD does go a lot faster with an API change (see. Benchmarking Substructure Search) but this has not yet been added.

ferrocene as 200 isomorphisms to it's self

The average (r=10) time taken to match and count isomorphisms (n) was tested on several symetric structures. All structures were normalised to only have sigma bonds. The SMSD implementation did not get the correct number of matches which I think is a bug propagated from the original ChemKit implementation. If any matcher took longer than 10 minutes I considered it as timed out (shown as n/a).

Structure	Matched	UIT	SMSD	Ullmann	VentoFoggia
Structure	n	t (s)	t (s)	t (s)	t (s)
ferrocene	200	0.281	0.166 (n=180?)	0.007	0.001
fullerene C60	120	279	7.5 (n=1)	1.2	0.017
fullerene C70	60	n/a	9.8 (n=1)	2	0.058
fullerene C320	120	n/a	n/a	n/a	0.527

VentoFoggia only took half a second to find all matches in fullerene C320. I was curious to see how far it could go. The new matchers allow disconnected structures and matching multiple cyclopropanes we quickly reach a huge number of matches.

C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1 has more than 33 million isomorphisms

Structure	Matched	Ullmann	VentoFoggia
(SMILES)	n	t (s)	t (s)
`C1CC1`	6	0	0
`C1CC1.C1CC1`	72	0.002	0.002
`C1CC1.C1CC1.C1CC1`	1,296	0.002	0.005
`C1CC1.C1CC1.C1CC1.C1CC1`	31,104	0.168	0.037
`C1CC1.C1CC1.C1CC1.C1CC1.C1CC1`	933,120	6.5	1.1
`C1CC1.C1CC1.C1CC1.C1CC1.C1CC1.C1CC1`	33,592,320	270.5	36.3

Dataset Searching

As a more general gauge of performance the NCI Aug 2000 (~250,000 structures) was searched for structures containing O=C1C=CCC=C1. It's a single isolated test so conclusions can't really be drawn but it provides a sense for the timings.

Query structure matched against ~250,000 structures in the NCI Aug 2000 data set.

The SMSD uses heuristics internally whilst the others were run without heuristics on every structure (even if the query was larger than the target). How long the matching took for each implementation is shown below. When finding hits we can find the first match in each structure (3,623 total) or find all the hits (12,706 total). The SMSD matcher is fast when finding the first matching in each structure but much slower to find all matches.

Implementation	t first	t all
UIT	43.4 s	43.6 s
SMSD	28.6 s	4 m 40 s
Ullmann	6.3 s	6.5 s
VentoFoggia	5.1 s	4.9 s

Stereochemistry Matching

The new matchers also check the tetrahedral and geometric (double-bond) stereo configurations are consistent. Technically it's more efficient to check stereo configurations as the vertices are mapped but I decided to go for a post-matching check. This is more modular and also means we can use the same stereo consistency checking for other matchers (the existing UIT). This is where the lazy matching is really useful as we can get the first match which is stereochemically valid without finding all matchings.The stereo checking also seems to be fast enough and I've left it on by default (including all times in this post).

Using the VentoFoggia implementation, it takes ~38 seconds to find all 4,306,056 matches for C(C)(C)(C)(C) in ChEMBL 17. Matching one with a stereo configuration ([C@](C)(C)(C)(C)) is actually just as quick, taking ~37 seconds to find 1,007,652 matches.

Finally I'll leave you with a sneak peak at the Smiles class I have planned which makes loading and matching molecules from SMILES concise and elegant.

Pattern pattern = VentoFoggia.findSubstructure(Smiles.fromString("[C@](C)(C)(C)(C)"));
for (IAtomContainer target : Smiles.fromPath("/db/chembl/chembl_17.smi.gz")) {
    hits += Iterables.size(pattern.matchAll(target)); // Guava utils
}

Code

The matcher code is currently awaiting review on the branch feature/isomorphisms. If the branch is missing then it will be in the package org.openscience.cdk.isomorphism.

References

1. Ullmann JR. An Algorithm for Subgraph Isomorphism. Journal of the Association for Computing Machinery. 2004
2. Cordella LP et al. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2004
3. Rijnbeek M and Steinbeck C. OrChem - An open source chemistry search engine for Oracle. Journal of Cheminformatics. 2009
4. Rahman SA et al. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics. 2009
5. Gouda K and Hassaan, M. A fast algorithm for subgraph search problem. Informatics and Systems (INFOS). 2012

Friday 25 October 2013

All The Small Things

The SMILES Arbitrary Target Specification (SMARTS) is a succinct way of expressing and matching structural attributes of molecules. SMARTS is ubiquitous within cheminformatics and involves first interpreting an expression into a query and matching that query against one or more target molecules. When performing a SMARTS match one may presume a bottleneck is the required subgraph isomorphism. In general this is true but whilst playing around with the Chemistry Development Kit (CDK) yesterday I discovered how some small changes could make a noticeable difference.

Benchmarking is difficult, benchmarking between toolkits, even more so. It's often hard to isolate the functionality and remove parts we don't really want to measure. Utilising a use case only provides an approximate measure for comparison but is easy to relate to.

Use case

A use case I've wanted to test for a while was the "time taken to find O=[C,N]aa[N,O;!H0] hits in 250,251 SMILES of the NCI August 2000 data". This is taken from a presentation by NextMove Software on Efficient matching of multiple chemical subgraphs (see slide 6). Edit: I now realise that is actually an example from the Daylight Chemistry Cartridge - Intra-molecular hydrogen bonds.

The CDK, did not finish and I was generally curious to see how long it would take. I think the DNF was actually due to some exceptions being thrown but I'll make those clear in the results. Running on different hardware means we can't draw concrete comparison between the times but they should give a ball park figure. One should remember: "Measurement is not prophesy" - The Computer Language Benchmarks Game.

The NCI August 2000 data (available: cactus.nci.nih.gov) has several desirable properties.

The data is publicly available
The SMILES are all valid (actually quite rare)
The number of molecules isn't too large
The SMILES are a kekulé representation and aromaticity must be perceived

The results where measured using the following program. The times shown are from t0 and t1 includes IO and are from a cold start (no warmup through repeats). The VM startup time was not included but is generally minimal.

Code 1 - SMARTS matching a file

static final IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();

public static void main(String[] args) throws IOException {

    String path = "/databases/nci/nci_aug00.smi.gz";

    IteratingSMILESReader smis  = new IteratingSMILESReader(gunzip(path), bldr);
    SMARTSQueryTool       query = new SMARTSQueryTool("O=[C,N]aa[N,O;!H0]", bldr);  
    int nHits = 0, nFail = 0;
        
    long t0 = System.nanoTime();
    while (smis.hasNext()) {
        try {
            if (query.matches(smis.next()))
                nHits++;
        } catch (Exception e) {
            nFail++;
        }
    }
    long t1 = System.nanoTime();

    System.out.println(nHits + " hits (" + nFail + " in error) in "
                             + toSeconds(t1 - t0));
        
    smis.close();
}
  
static String toSeconds(long t) {
    long s  = TimeUnit.NANOSECONDS.toSeconds(t);
    long ms = TimeUnit.NANOSECONDS.toMillis(t - TimeUnit.SECONDS.toNanos(s));
    return s + " s " + ms + " ms";
}

static Reader gunzip(String path) throws IOException {
    // ensure GZIP buffer lines up with BufferedReader (SMILES Iterator)
    return new InputStreamReader(new GZIPInputStream(new FileInputStream(path), 8192));
}

Results

The above program was run on several versions of CDK. The 1.4.x branches are the stable releases whilst the 1.5.x are the unstable developer releases. The 1.5.4.git is the current (unreleased) master. The 1.5.4.git+ and 1.5.4.git++ include some additional tweaks.

Version	Matched	Errors	Time
1.4.15	11,145	64	819 s 860 ms
1.5.3	11,145	37	279 s 725 ms
1.5.4.git	11,146	37	225 s 639 ms
1.5.4.git+	10,136	37	160 s 713 ms
1.5.4.git++	10,136	0	16 s 18 ms

1.4.15 to 1.5.3

The difference seen between 1.4.15 and 1.5.3 is primarily due to the AllRingsFinder. This is also the cause of the errors. The 1.5.3 version includes the optimisations discussed previously (AllRingsFinder, Sport Edition) and accounts for the majority of the improvement. The AllRingsFinder was being used in SMARTS matches for setting the ring invariants, although this isn't needed at all it only takes ~5 seconds and isn't a huge factor when it's removed later.

1.5.3 to 1.5.4.git

The 1.5.4.git version includes several changes such as a new SMILES parser and faster atom typing using the new RingSearch (Scaling Up - Faster Ring Detection in CDK). The new SMILES parser behaviour is a little different, kekulisation is now automatic and hydrogen counts are set on all atoms. Atom typing and aromaticity perception is not done automatically by the parser but is still done by the SMARTSQueryTool and so removal from the parser has little impact on performance. The number of matched molecules changed between 1.5.3 and 1.5.4.git. This was due to a bug in the aromaticity perception. Allowing removal of items from a collection whilst iterating can result in some being skipped over (fixed: f03170a).

Erroneous aromaticity fixed in 1.5.4.git means one structure is no longer matched and two previously unmatched are now found. Matched atoms are highlighted.

1.5.4.git to 1.5.4.git+

The 1.5.4.git+ is faster than 1.5.4.git as no atom typing is required and faster ring perception is used with a new aromaticity utility. Providing implicit hydrogen counts are set atom typing is only needed if the CDK aromaticity model or orbital hybridisation queries (an extension to SMARTS) is used. The new Aromaticity implementation allows us to define different models of electron contribution and choose which ring sets we apply that model to. Using a new model which is more similar to Daylight's requires only bond orders and hydrogen counts and not the atom types. There are differences in perceived aromaticity and this results in a different number of matches. From the differences tested the 1.5.4.git+ matches agree with DEPICTMATCH. I've included all the matched SMILES at the end of the post.

Some of the different matches (highlighted) due to aromaticity models

1.5.4.git+ to 1.5.4.git++

Perhaps most surprisingly was the small change between 1.5.4.git+ and 1.5.4.git++. It turns out that the majority of 160 seconds was spent in initializeMolecule() rather than the actual subgraph matching. The initialisation computes several invariants on the target molecule required for some query atoms to match. The invariants are the values such as; ring connectivity, ring number, degree, total hydrogen count and valence. These can be provided lazily or precomputed. Lazy properties would require each atom knowing the molecule to which it belongs. This is not how the CDK API is structured and so these are precomputed before a SMARTS target is matched.

As mentioned earlier AllRingsFinder is not needed and was removed. You may want it as an extension and although it's now fast enough it certainly shouldn't be a default option. The SSSRFinder was replaced with a faster implementation and is now only computed if there are ring queries. Finally the other invariants are incrementally built up using a data structures optimised for access. Egon and I were actually looking at using this technique to also speed up atom typing. There are no fancy tricks and for brevity I'll omit a full description but it should be included in the code base soon enough. The 16 seconds shown earlier is the current time taken using the new default options on my patch.

Could it go even faster?

There is currently some overhead going between different data structures. The SMILES are loaded with Beam and then converted to the CDK IAtomContainers. CDK takes 3,327 ms to load the NCI molecules whilst Beam takes only 1,500 ms. To read an apply the aromaticity model in CDK takes 9,297 ms whilst in Beam it takes 5,492 ms.

It takes about a 1 second to convert the 250,000 CDK IAtomContainers to a graph representation optimised for quick traversal and this has to be done 3-4 times. Once for computing the invariants, aromaticity and isomorphism test. The forth time depends on whether ring information is required - the ring info isn't persisted on the IAtomContainer despite having already been done for the aromaticity. Improvements here would require a large restructuring of the API and domain objects.

Finally you may have noticed I didn't touch the UniversalIsomorphismTester. It's is actually pretty fast but the subgraph isomorphism is a by-product of an MCS computation. Using an optimised Ullmann/VF2 implementation could see further improvements. I might look in to patching the matcher from Ambit-SMARTS for further gains. I did try the subgraph implementation from the latest SMSD but for this particularly use case it was slower than the existing UniversalIsomorphismTester.

Matches

Different SMILES matched between the versions

Edit:

To obtain the input, download the 2D SDF file and extract the SMILES field.

~: grep -A1 SMILES NCI_aug00_2D.sdz | grep -ve 'SMILES\|^--$' > nci_aug00.smi

Thursday 19 September 2013

SMILES Implicit Valence of Aromatic Atoms

The last couple of months I've been reworking the Chemistry Development Kit (CDK) handling of Simplified molecular-input line-entry system (SMILES). In this process there has been a couple of discoveries which may be interesting to share. Something previously not provided by the CDK was correct handling of implicit valence.

Both the Daylight Theory Manual and OpenSMILES Specification explain the concept for the organic subset. The specifications provide a more detailed explanation but as a simple example with the SMILES '[CH2]=[CH][CH3]' we can omit hydrogens counts and just write 'C=CC'. However, with '[CH]=[CH][CH3]' we must include the count on the first atom '[CH]=CC'. We include the labelled count if the implied count does not match.

The OpenSMILES specification explains how we can determine the implied hydrogen count.

"The implicit hydrogen count is determined by summing the bond orders of the bonds connected to the atom. If that sum is equal to a known valence for the element or is greater than any known valence then the implicit hydrogen count is 0. Otherwise the implicit hydrogen count is the difference between that sum and the next highest known valence." - OpenSMILES

This approach is simple and elegant for the aliphatic subset with non-aromatic bonds but what about the aromatic subset? One could speculate we have to assign explicit bond orders (kekulizé) a molecule to obtain the correct counts. Indeed the OpenSMILES specification seems to suggest that.

"...the SMILES software must verify that electrons can be assigned without violating the valence rules..."

Personally, I do not believe this to be the case and having the correct hydrogen count provides an additional validation if we do assign a kekulé structure.

Perhaps how we get the count for the aromatic subset is obvious and doesn't need including in the specification. However as I will show there are a couple approaches used by the different (open source) toolkits.

Disclaimer: I may be completely wrong but given the lack of documented explanation on the issue I believe it's important to at least collect some ideas on how it is currently done and how it could be done.

Parsing in Toolkits

It appears that many toolkits use their own internal valence model rather than one specific to SMILES. It is difficult to know all the intricacies of each toolkit but here's what I see from a high level view. Any maintainers of the respective toolkits please let me know of any mistakes and I'll add a correction.

The CDK uses it's own valence model via a decision tree based atom typer. The atom typer assigns a formal neighbour count which is then used to infer the number of hydrogens. This piece of code is from the end of the current SmilesParser.java:258-276. The decision tree in CDK is a large if else cascade checking possibly exceptions.

Code 1 - CDK atom typing

// CDK org/openscience/cdk/smiles/SmilesParser.java lines 258-276.
 
IAtomType type = matcher.findMatchingAtomType(molecule, atom);
boolean isAromatic = atom.getFlag(CDKConstants.ISAROMATIC);
AtomTypeManipulator.configure(atom, type);
atom.setFlag(CDKConstants.ISAROMATIC, isAromatic);

Open Babel also appears to use it's own valence model. There are actually two parsers - one of them uses the OBAtomTyper to assign implicit valence (smileyformat.cpp:287-288). The OBAtomTyper uses electron counting and SMARTS patterns. I'll come back to how aromatic bonds are treated later.

Code 2 - OpenBabel atom typing

// Open Babel src/formats/smileyformat.cpp lines 278-288
 
OBAtomTyper typer;
typer.AssignImplicitValence(*pmol);

The other parser assigns bond orders to aromatic bonds after parsing. The code snippet below is at the end of the parser (smilesformat.cpp:516-520). With set bond orders the counting is the same as with the aliphatic subset.

Code 3 - OpenBabel atom typing additionals

// Open Babel src/formats/smilesformat.cpp lines 516-520
 
    //set aromatic bond orders
    mol.SetAromaticPerceived();
    FindAromaticBonds(mol);
    FindOrphanAromaticAtoms(mol);// CM 18 Sept 2003
    mol.AssignSpinMultiplicity();
    mol.UnsetAromaticPerceived();

RDKit again uses it's own internal valence model. This model is invoked when parsing SMILES if sanitise is set (SmilesParse.cpp:184-198). Interestingly the RDKit internal model uses a similar approach to Open Babel and my first implementation. Frowns also uses the same approach and we will come back to this later.

Code 4 - RDKit Sanitize

// RDKit Code/GraphMol/SmilesParse/SmilesParse.cpp lines 184-198
 
    if(sanitize && res){
      // we're going to remove explicit Hs from the graph,
      // this triggers a sanitization, so we do not need to
      // worry about doing one here:
      try {
        ROMol *tmp = MolOps::removeHs(*res,false,false);
        delete res;
        res = static_cast<RWMol *>(tmp);
        // figure out stereochemistry:
        MolOps::assignStereochemistry(*res,true);
      } catch (...) {
        delete res;
        throw;
      }
    }

The recent parser written in Scala, chemf, uses a concise decision tree of what aromatic combinations "make sense" (SmilesAtom.scala:38).

Code 5 - Chemf

/* Snippet from chemf chemf.parser.SmilesAtom.scala lines 45 - 62*/
 
    bs count (Aromatic ==) match {
      case 1 ⇒ default (2 + (bs foldMap (_.valence)))
      case 0 ⇒ default (bs foldMap (_.valence))
      case _ ⇒ bs.toList sortBy (_.valence) match {
        case Aromatic::Aromatic::Aromatic::Nil ⇒
          if (e == C || e == N || e == P || e == B) 0.success else fail
        case Aromatic::Aromatic::Single::Nil   ⇒ 
          if (e == C || e == N || e == P || e == B) 0.success else fail
        case Aromatic::Aromatic::Double::Nil   ⇒ 
          if (e == C || e == S) 0.success else fail
        case Aromatic::Aromatic::Nil           ⇒
          if (e == C || e == B) 1.success
          else if (e == N || e == P || e == O || e == S) 0.success
          else fail
        //other combos with 2 or more aromatic bonds don't make sense
        case _ ⇒ fail
      }
    }

OUCH also uses it's own valence model. The primary function used is fillValenceAtIndex. Here is the section which computes the implicit hydrogen count - some comments have been added to explain the variables. Essentially the neighbours are counted and the method is recursively invoked adding hydrogens (and lone pairs) until a default valence is reached.

Code 6 - OUCH

-- snippet from Ouch/Structure/Molecule.hs lines 380-391
 
                     -- n neighbors is larger than the first valence + second
                     -- valence => identity (do nothing)
        outputH    | nb >= ((fst val) + (abs(snd val))) = m     
 
                     -- There is one or more bonds to a radical and the atom 
                     -- is aromatic => recurse with a molecule that has electrons added (mEl)
                   | nbrB && Set.member AromaticAtom (atomMarkerSet a) = fillValenceAtIndex mEL i
 
                     -- Fill empty valences with hydrogen => recurse with a molecule that has 
                     -- an extra hydrogen on this atom index (mH)
                   | nba < fst val                      = fillValenceAtIndex mH i
 
                     -- Then, fill lone-pairs if needed => recruse with a molecule that has an
                     -- extra lone pair on this atom index (mLp)
                   | nb < ((fst val) + (abs(snd val)))  = fillValenceAtIndex mLP i
 
                     -- Pattern completion (default) => identity
                   | otherwise = m

In the above function the bond order sum is used. Interestingly aromatic bonds contribute '2' to the sum (Atom.hs:136-148).

Code 7 - OUCH snippet 2

-- Snippet from Ouch Ouch/Structure/Atom.hs lines 136-148 
 
occupiedValence :: Atom -> Integer
occupiedValence a = output
  where output = Set.fold (\b acc -> acc + bondOrder b) 0 (atomBondSet a)
        bondOrder b = case b of
          Sigma    {}   -> 1
          Pi       {}   -> 2
          PiPi     {}   -> 3
          Aromatic {} -> 2
          Delta    {} -> 4
          Hbond    {} -> 0
          Ionic    {} -> 0
          Antibond {} -> 1
          Any      {} -> 0

PerlMol has a valence model specific to SMILES. The parser doesn't have the aromatic bond token ':' and uses a bond order of '1' for an implicit bond. The hydrogen count is computed as per the aliphatic subset - if the atom is an aromatic carbon or nitrogen the hydrogen count is decreased.

Code 8 - PerlMol

# Snippet from PerlMol Chemistry/File/SMILES.pm lines 412-423
 
# returns the number of hydrogens for an atom, assuming it has
# no charge or radical (because those require an explicit H-count anyway)
sub calc_implicit_hydrogens {
    my ($self, $atom) = @_;
    no warnings 'uninitialized';
    my $h_count = $ORGANIC_ELEMS{$atom->symbol} - $atom->valence;
    if ($atom->attr("smiles/aromatic") and $atom->symbol =~ /^[CN]$/) {
        $h_count--;
    }
    $h_count = 0 if $h_count < 0;
    $h_count;
}

It's clear there are different approaches and it's tempting to speculate there are still some more differences in closed source toolkits.

Generation

Of course the implicit hydrogen count of the aromatic subset also maters when generating SMILES. One well known example is of aromatic nitrogens which contribute a lone pair - for example pyrrole.

1H-pyrrole

The CDK, Open Babel and RDKit handle this as a special exception when generating the SMILES string. The CDK code refers to this as a special nitrogen (SmilesGenerator.java:1652-1657).

Code 9 - CDK's special nitrogen

  // we put in a special check for N.planar3 cases such
            // as for indole and pyrrole, which require an explicit
            // H on the nitrogen. However this only makes sense when
            // the connectivity is not 3 - so for a case such as n1ncn(c1)CC
            // the PLANAR3 N already has 3 bonds, so don't add a H for this case
    boolean isSpecialNitrogen =
    a.getSymbol().equals("N") &&
    a.getHybridization() == IAtomType.Hybridization.PLANAR3 &&
    container.getConnectedAtomsList(a).size() != 3 &&
    (a.getFormalCharge() == null || a.getFormalCharge() == 0);
   brackets = brackets | isSpecialNitrogen;

Similar to CDK, Open Babel has the method CorrectAromaticAmineCharge (smilesformat.cpp:2569-2592)

Code 10 - OpenBabel CorrectAromaticAmineCharge

void OBMol2Cansmi::CorrectAromaticAmineCharge(OBMol &mol)
  {
    OBAtom *atom;
    vector::iterator i;
 
    _aromNH.clear();
    _aromNH.resize(mol.NumAtoms()+1);
 
    for (atom = mol.BeginAtom(i);atom;atom = mol.NextAtom(i))
      if (atom->IsNitrogen() && atom->IsAromatic())
        if (atom->GetHvyValence() == 2)
          {
            if (atom->GetValence() == 3 || atom->GetImplicitValence() == 3)
              _aromNH[atom->GetIdx()] = true;
          }
  }

These hardcoded exceptions work, unless you miss an exception...

1H-phosphole

Neither CDK or the default Open Babel parser will include the required hydrogen. Note - obabel v 2.3.1 (latest on Mac), this might have been fixed in v 2.3.2..

Code 11 - OpenBabel aromatic phosphorus

[intrepid ~]: obabel -:"[nH]1cccc1" -osmi
[nH]1cccc1  
1 molecule converted
[intrepid ~]: obabel -:"[pH]1cccc1" -osmi
P1CCCC1 
1 molecule converted

Adding another conditional (as in RDKit) for aromatic phosphate solves this problem (SmilesWirte.cpp:75-69).

Code 12 - RDKit non-standard handling

// another type of "nonstandard" valence is an aromatic N or P with
// explicit Hs indicated:
if((num==7||num==15) && atom->getIsAromatic() && atom->getNumExplicitHs()){
    nonStandard=true;
}

Although this will handle most (possibly all) cases if there is a way to assign implicit hydrogen counts no special treatment is needed. As with the aliphatic subset if the implicit hydrogen count doesn't match what we have stored we must include the hydrogen counts.

Fractional Bond Orders

As mentioned earlier, my first approach was the same as OpenBabel, RDKit's and Frowns (something I didn't realise until collecting the information for this post). The approach seems obvious - aromatic bonds have bond order '1.5'. This is of course not true but we will see how this model fits examples.

c1ccccc1

Benzene, we find that the bond order sum of each carbon is 3 (1.5 + 1.5). The difference between carbon's next highest valence (4, as per the specification) is 1. So each carbon has 1 implied hydrogen

c1ccc2ccccc2c1

Naphthalene, has two carbons with three aromatic bonds and each with a bond order sum of 4.5 (1.5 + 1.5 + 1.5). This is higher than any specified valence and thus there are no hydrogens.

c1ccc2ocnc2c1

1,3-benzoxazole, has an aromatic nitrogen and oxygen. The nitrogen as two aromatic bonds and thus a bond order sum of 3 (1.5 + 1.5). This is a default valence for nitrogen and so there is no implicit hydrogen. The oxygen also has the bond order sum of 3 which exceeds it's maximum specified valence (2) and again there are no hydrogens.

c1n([H])ccc1

1H-pyrrole, with the labelled hydrogen (c1[nH]ccc1) will work fine - the brackets mean we don't have to look up the implicit valence. However, what happens if we make that an explicit hydrogen? If we apply our rule we hit a problem. The bond order sum of that nitrogen is now 4 (1.5 + 1.5 + 1). The next highest specified valence is 5 and so we have and additional implicit hydrogen (two hydrogens total)?

c1cn2ccccc2n1

imidazo[1,2-α]pyridine, one nitrogen has 3 aromatic bonds and thus using this model we determine it has a bond order sum of 4.5 (1.5 + 1.5 + 1.5). Depending on whether we choose to round up or down that nitrogen will have either a single hydrogen or none.

To fix the problematic examples we need an additional rule. Only check the default valence for aromatic atoms. A justification for this could be that elements which have more than one valence cannot go to a higher valence state as the electrons which would bond those hydrogens are contributed to the pi bonding system. We can see this is also the approach used by RDKit - although RDKit doesn't allow 5 valent nitrogen so this particular example won't occur. Also note that the partial bond order sum is rounded up.

Code 12 - RDKit valence calculation

  /* RDKit/Code/GraphMol/Atom.cpp lines 193-232*/

  // check accum is greater than the default valence
  unsigned int dv = PeriodicTable::getTable()->getDefaultValence(d_atomicNum);
  int chr = getFormalCharge();
  if(isEarlyAtom(d_atomicNum)) chr*=-1;  // <- the usual correction for early atoms
  if (accum > (dv + chr) && this->getIsAromatic()){
    // this needs some explanation : if the atom is aromatic and
    // accum > (dv + chr) we assume that no hydrogen can be added
    // to this atom.  We set x = (v + chr) such that x is the
    // closest possible integer to "accum" but less than
    // "accum".
    //
    // "v" here is one of the allowed valences. For example:
    //    sulfur here : O=c1ccs(=O)cc1
    //    nitrogen here : c1cccn1C
    
    int pval = dv + chr;
    const INT_VECT &valens = PeriodicTable::getTable()->getValenceList(d_atomicNum);
    for (INT_VECT_CI vi = valens.begin(); vi != valens.end() && *vi!=-1; ++vi) {
      int val = (*vi) + chr;
      if (val > accum) {
        break;
      } else {
        pval = val;
      }
    }
    accum = pval;
  }
  // despite promising to not to blame it on him - this a trick Greg
  // came up with: if we have a bond order sum of x.5 (i.e. 1.5, 2.5
  // etc) we would like it to round to the higher integer value -- 
  // 2.5 to 3 instead of 2 -- so we will add 0.1 to accum.
  // this plays a role in the number of hydrogen that are implicitly
  // added. This will only happen when the accum is a non-integer
  // value and less than the default valence (otherwise the above if
  // statement should have caught it). An example of where this can
  // happen is the following smiles:
  //    C1ccccC1
  // Daylight accepts this smiles and we should be able to Kekulize
  // correctly.
  accum += 0.1;

A similar approach is used in Frowns library. The snippet below shows the use of a 1.5 bond order from Atom.py:82-92.

Code 14 - FROWNS sumBondOrders

     def sumBondOrders(self):
         result = 0
         for x in self.bonds:
             if x.bondtype == 4:
                 # XXX FIX ME
                 # this is a hack to fix bad conjugation
                 # this will be fixed soon
                 result += 1.5
             else:
                 result += x.bondorder
         return result

Open Babel also uses a bond order of '1.5'. Although in their case they actually count electrons and then divided by 2 (atom.cpp:1000-1015). This avoids fractional bond orders - at least until the division where the fractional part is truncated (5/2 = 2).

Code 15 - OpenBabel valence

// Open Babel src/atom.cpp lines 1000-1015
 
for (bond = ((OBAtom*)this)->BeginBond(i);bond;bond = ((OBAtom*)this)->NextBond(i))
{
    bo = bond->GetBO();
    bosum += (bo < 5) ? 2*bo : 3;
}
 
bosum /= 2;
return(bosum);

Using this model will work very well and give use correct results. There are however still some questions, in particular the ambiguity and having to round bond order sums is troublesome.

Also we need to know which bonds are aromatic - 'cc' is an aromatic bond and 'CC' is a single bond if we have these as an implicit bond as suggested by SMILES we need to check the atoms in each case to determine if the order is '1' or '1.5'. Additionally I found a couple of examples in ChEMBL where up ('/') and down ('\') bonds are part of an aromatic system (contrived example below). Again we need to check the atoms of those bonds to see if they are aromatic or single. These problems only occur if we have bond labels specifically for handling SMILES but it would be nice to have something more elegant.

C\C=c1/ccc(=C)cn1

Atom-based

After using the fractional bond order and testing on some datasets I was relatively happy it solved the problem. Using a bond order of 1.5 will certainly work - pragmatic, yes - elegant, no.

Despite this I couldn't shake the feeling that "Dijkstra would not have liked this". I think this is also evident from the comments in the RDKit and Frowns source code (see above).

What I now believe is the correct way, or at least closer to the Daylight way is that instead of modifying the bond orders we modify the starting value of our bond order sum. With aromatic atoms, instead of starting at 0 we start at 1.

The point of the 1.5 bond order is to increase the bond order sum to be higher for aromatic atoms. We can achieve the same thing by simply changing the starting value of the sum. This also fits in that aromaticity in SMILES is primarily defined by atoms and not bonds. There is the ':' aromatic bond but it is not required (and rarely/never used).

In practice this means implicit, single, aromatic, up and down bonds now all contribute 1 to the bond order sum. Of course this is completely bogus but we're not saying that aromatic bond is the same as a single bond - rather just that it contributes a bond order of '1'. This also means that the implicit bond between aliphatic and aromatic atoms 'cc', 'CC' and 'cC' would all contribute the same to the bond order and we can treat them the same.

This is particularly useful for cases like '[Te]1cccc1' (ChemSpider:119908). Using a bond order of 1.5 we would determine the bond order sum of each carbon adjacent to the '[Te]' to be '2.5' or '2' (5/2). Rounding up or down we get '1' (correct) or '2' hydrogens.

It would seem that PerlMol was nearly there - the implicit bond order in PerlMol is '1' and the hydrogen count was decreased for aromatic atoms. This was only done for carbon and nitrogen though and should be applied to all aromatic atoms.

Using this method cleans up the calculation and removes the fractional results. As before we must still only use the default (lowest) valence for multivalent elements. Perhaps there is a more elegant way to avoid that that but it has so far eluded me.

Is it Correct?

I can not say for certain whether this model is correct but it is certainly closer than using a fractional bond order.

The well known Depict service from Daylight displays the structure diagram for a given SMILES string. It will also report problems with SMILES. For example it reports that [C]C has unusual valence 1 (normal 2).

Now what is really interesting is if we give it some aromatic atoms. It turns out that a single aromatic atom [c] also has unusual valence 1 (normal 2). But wait - there aren't any bonds to that atom but it has valence of 1?

Aromatic carbon has no neighbours but has a valence of 1?

Let's add one neighbour [c]c - no warning (2 valent carbon is considered okay). Okay let's try something else, c[c]cc has low valence 3 (normal 4). Again there are only two neighbours but it has a valence of 3. The same is also true for nitrogen, phosphate and the atoms in the aromatic subset.

This also explains why c is '[CH3]'.

Using this approach also allows us to get the correct count for some strange examples. One example could be an acyclic aromatic atom (rejected by some toolkits). This can happen for example with an exocyclic oxygen (oc1ccocc1). Using this approach we reach the correct answer that there are no hydrogens on that sprouting oxygen. This can then be used in the assignment of a correct Kekulé structure. One can see how this SMILES string is interpreted 4 different ways by various toolkits - oc1ccocc1 (Ambit).

With this model it also means that when we generate the SMILES we can now check how many implicit hydrogens aromatic atoms should have. Doing so we would say that for pyrrole the implied hydrogen count would be '0'. This does not match our stored value and so we need to include the bracket atom with the labelled hydrogen count.

Complications

There are of course still some complications - which seem to be when we mix aromatic and aliphatic parts. The fact that the first case here is read as cyclohexane further shows there isn't an aromatic bond order used.

1. C1:C:C:C:C:C1 (daylight reads as cyclohexane)
2. C=1:C=C:C=C:C1
3. c=1c=cc=cc1 (problematic)
4. c-1c-cc-cc1

With the described model we can't reach the correct answer for (3.). You're unlikely to encounter such a case in a clean dataset but we can modify our model slightly to get the correct results.

Instead of starting at '1' we add '1' to the bond order sum of aromatic atoms if the bond order sum is the same as the number of explicit neighbours. Indeed this agrees with Depict that 'c[c]cc' has a valance of 3 but '[c]=c' has a valence of 2 (if we started at '1' it would also be '3').

Hopefully this has provided some insight or at least show how the implicit hydrogens can be counted in aromatic SMILES.

Supplementary

Re-reading back through the mailing list it seems this has of course be discussed (at least partially) - no decision was made. As usual the sourceforge mailing archives are a pain to find stuff in but the discussion [1] is summarised on the OpenSMILES repository [2].

As I see it there are three models which can be used.

Use bond order of 1.5 for aromatic bonds.
Reduce the hydrogen count for aromatic atoms - also attempts to allow this to specify radicals (C1CCcCC1). This has many problems but that would need another post.
Increase the valence by 1 if there are no double/triple bonds.

Mailing List Links

http://sourceforge.net/mailarchive/forum.php?thread_name=60825b0f0709302037g2d68f2eamdb5ebecf3baea6d1%40mail.gmail.com&forum_name=blueobelisk-smiles
https://github.com/timvdm/OpenSMILES/blob/master/discussion_summary.txt

Other links

OpenSMILES and aromaticity - Andrew Dalke
Smiles and Aromaticity: Broken? - Rich Apodaca

Thursday 27 June 2013

Implementing Hanser's Path Graph Reduction - All Simple Cycles

In the previous post (AllRingsFinder, Sport Edition) I presented runtime results for an algorithm to find all simple cycles. This post explains in detail that implementation - there's a fair bit of code but I tried to keep it short. Hanser's algorithm is remarkably elegant and was a joy to program.

The Algorithm

The set of all cycles can be found in a number of ways. We could trace the cycles by walking every simple path (very inefficient) or perhaps combine all cycles from a cycle basis (requires basis computation first). Hanser et al, 96 describe an alternative approach - it allows us to find all the simple cycles directly.

The process is based on progressive reduction (collapsing) of a path graph. Opposed to edges (bonds) which consist of just the two endpoints, the path graph has edges which describes a walk between the endpoints. To begin all edges are simply their end points, then by removing each vertex the edges are concatenated in to path edges describing a walk. When an edge forms a loop (endpoints are the same vertex) a cycle has been found.

The animation below demonstrates the process of finding the three cycles in a naphthalene like graph. The vertex to be removed is first highlighted - and then the edges that will be combined. It's a little hard to follow when f is reached but there are three edges and three possible ways to combining two edges - from {1,2,3} we can have {1,2}, {2,3} and {1,2}. At the end the three cycles have been found as the paths on the loops.

Removing vertices in a path graph

Implementation

The pseudo code below describes the removal of a vertex x, from the set of vertices V and edges E. First, find all edges which contain x, then for every possible combination (E²), check that the only intersect is x. This avoids creating non-simple paths. If the two edges only intersect at x, then concatenate them to a new reduced edge. Add the new edge to the edges set, E. Once all new edges have been created remove, x and it's old edges from E and V.

REMOVE (x, V, E)
for each couple of paths (P_xy, P_xz) ∈ E²
if P_xy ∩ P_xz = {x} then
P_yz ← P_xy ⊕ P_xz
E ← E ∪ P_yz
for each path P_xy ∈ E
E ← E - {P_xy}
V ← V - {x}

The exact implementation is not discussed in the paper but notes the requirement of dynamic lists (List) of the edges and binary sets (BitSet) for the labels on the edges.

As expected the AllRingsFinder.remove() uses dynamic lists to store the vertices (atoms). The intersection does not use binary sets and instead takes quadratic time, O(n²) - Path.getIntersectionSize(). An alternate implementation (HanserRingFinder) reduces the number of operations for each removal PathGraph.remove(). The intersection check (PathEdge.isRealPath()), was also improved but is still quadratic.

Observations

So, can we do better and how?

We are actually going to do the opposite from the previous implementations, we will use the binary sets and not use a dynamic lists.

One key realisation is that the path graph is no different from any other graph. We can use different representations of the path graph depending on what operations we need to perform (see. The Right Representation for the Job). As we only reduce edges incident to x we can use an adjacency/incident list to store the path graph. A summary of these and other changes is shown below:

Use binary sets - intersection check is now constant time, O(1).
Use an incidence list data structure and index edges by their endpoints. With a modification discussed later, we can actually make this constant time.
Only check that edges are disjoint except for their endpoints. This is faster than checking if the sets make a singleton of just {x}. As the edges are now indexed by x, all the edges will have x in common. If the other end point is also common then the two edges will form a loop (cycle) when combined. This can be allowed by moving the check for new cycles.
Check the degree of a vertex before removal, this provides a fail-fast approach and avoids inconsistencies of a timeout. This is possible as the number of new edges is bounded by the degree of x.
Use a recursive data structure and store undirected path edges. This avoids the the reversing and copying of the dynamic lists for each concatenation. The directed path can then be reconstructed when needed - only when a cycle was found. This has minimal influence on speed on small molecules but does makes the implementation more concise.

With these observations in mind, here is what the data structures look like.

ArrayBuilder

To begin, a helper class is needed for the path reconstruction. This class simply allows us to sequentially fill up an int[] array by appending items to the end. It also provides access to the previous element and returns a self-reference on append which will allow us to the chain method calls.

Code 1 - ArrayBuilder

class ArrayBuilder {
    
    int[] xs;
    int   i = 0;
 
    ArrayBuilder(int n) {
        xs = new int[n];
    }
 
    ArrayBuilder append(int x) {
        xs[i++] = x;
        return this;
    }
 
    int prev() {
        return xs[i - 1];
    }
}

PathEdge

There are two types of PathEdge, one is a simple non-reduced edge and one for a reduced edge (a path). We can actually share most of the operations between these and so we first define the API in an abstract superclass.

Unsurprisingly the PathEdge API is very similar to a normal undirected edge in that there is a either() and other(int) for querying the endpoints. Unlike a normal edge, the bits of 'xs' member variable stores which other vertices in the graph lie in the path between the endpoints. We'll come back to order in a bit. Binary set intersection is checked using the '&' operator and for two edges to be disjoint - the intersection should be empty. The path() provides the ordered walk though the edge between the endpoints - we need to start from either end point and then reconstruct the path (see subclasses). It should be noted we use a BitSet for graphs with more than 64 vertices.

Loading ....

The first implementation, the simple edge, is just the two end points. The length is two and there are no reduced vertices. We can therefore always pass an empty set to the superclass. The path through this edge is just the endpoints. The previous vertex in the ArrayBuilder is used to query and append the other endpoint giving a path of length two.

Code 2 - SimpleEdge

final class SimpleEdge extends PathEdge {
    
    // Two endpoints and no reduced vertices (empty set).
    SimpleEdge(int u, int v) {
        super(u, v, 0);
    }
    
    // Given the previous vertex, add the other endpoint.
    ArrayBuilder reconstruct(ArrayBuilder ab) {
        return ab.append(other(ab.prev()));
    }
    
    // we only have the endpoints.
    int len() {
        return 2;
    }
}

The reduced edge is a little more complicated. It is created by concatenating two existing edges which share a common vertex. The new endpoints are the other() endpoint of each edge with respect to x. The set of reduced vertices is the union of the two edges reduced vertices and the new common vertex, x. The length of the path through this edge is the endpoints and number of reduced vertices - stored as bits so the bit count is added. Depending on the end point the path reconstruction proceeds through one edge and then through the other.

Code 3 - ReducedEdge

final class ReducedEdge extends PathEdge {
    
    final PathEdge e, f;
    
    //Two edges with a common endpoint (x). The reduced set 'xs' is the union of
    // the two edge's 'xs' sets and the common endpoint 'x'.
    ReducedEdge(PathEdge e, PathEdge f, int x) {
        super(e.other(x), f.other(x), e.xs | f.xs | 1L << x);
        this.e = e;
        this.f = f;
    }
    
    // Reconstruction delegates to reduced edges.
    ArrayBuilder reconstruct(ArrayBuilder ab) {
        // append the paths of the two edges, order depends on previous vertex 
        return u == ab.prev() ? f.reconstruct(e.reconstruct(ab))
                              : e.reconstruct(f.reconstruct(ab));
    }
    
    // Two endpoints and the number of reduced vertieces.
    int len() {
        return 2 + Long.bitCount(xs); // or: e.len() + f.len() - 1
    }
}

PathGraph

With our path edges defined we now need the graph to store them in. As noted, storing the edges in an incident list allows quick lookup. The add(PathEdge)includes a new edge in the graph and remove(int) removes and returns all edges incident to x. The reduce() method does the combination of the edges which are disjoint. Finally the remove() method, gets the incident edges, reduces them and then either adds them back to the graph or stores the newly discovered cycle. I've have omitted the degree() method for brevity but this is just the size of each list.

Code 4 - PathGraph

class PathGraph {
   
    // edges indexed by thier end points
    List[] g;
    
    PathGraph(int n) {        
        // initalise 'g'       
    }
    
    // adds an edge to the graph.
    void add(PathEdge e) {
        int u = e.either();
        int v = e.other(u);
        g[u].add(e);
        g[v].add(f);
    }
    
    // Remove (and get) all edges incident to 'x'.
    List remove(int x) {
        List es = g[x];
        for (PathEdge e : es)
            g[e.other(x)].remove(e);
        g[x] = Collections.emptyList();
        return es;
    }
    
    // combined all edges 'es' which are disjoint and share the 
    // endpoint 'x'
    List reduce(int x, List es) {
        List reduced = new ArrayList();
        
        for (int i = 0; i < es.size(); i++) {
            for (int j = i + 1; j < es.size(); j++) {
                
                PathEdge e = es.get(i);
                PathEdge f = es.get(j);
                
                if (e.disjoint(f)) 
                    reduced.add(new ReducedPathEdge(e, f, x));
            }
        }
        return reduced;
    }
    
    // Remove 'x' from the graph by reducing all incident edges. Any
    // newly discovered cycles are added to 'cs'.
    void remove(int x, List cs) {
        
        List es      = remove(x);
        List reduced = reduce(x, es);
        
        // for each new edge, either store the cycle or
        // add it back into the graph
        for (PathEdge e : reduced) {
            if(e.loop()) {
                cs.add(e.path());
            } else {
                add(e);
            }
        } 
    }    
}

Although the edge access is more efficient, as the graphs becomes more dense (during reduction) we need to modify more of the incident lists on each removal. We can avoid this by by imposing a predefined order on how vertices are removed. As recommended in the original publication, processing by degree is a good approach. With the ordering predefined, edges need only be indexed by a single endpoint.

Code 5 - Ranked removal

    // indicates the order which vertices will be removed
    int[] rank;
    
    // adds an edge to the graph on the least rank vertex.
    void add(PathEdge e) {
        int u = e.either();
        int v = e.other(u);
        if (rank[u] < rank[v])
            g[u].add(e);
        else 
            g[v].add(f);
    }
    
    // Remove (and get) all edges incident to 'x'.
    List remove(int x) {
        List es = g[x];
        g[x] = Collections.emptyList();
        return es;
    }

All these changes contribute (some more than others) to the performance improvements seen in the previous post (AllRingsFinder, Sport Edition). To run the algorithm we simply build the path graph then successively remove all the vertices.

What's really nice about this implementation is the PathEdge is a persistant data structure and can easily be implemented in a purely functional language. I actually managed to write a version in Scala and Haskell - one issue is that PathGraph also needs to be persistant. There are of course good persistant random access array but they're simply not as fast. Switching back to the edge list representation actually allows you to use cons-lists (singly linked list) but at the cost of performance.