Sunday, 9 August 2015

MMFF Partial Charges Improvements in CDK

Some time last year Mark Williamson brought to my attention discrepancies in CDK's MMFF partial charge calculation. Investigating further it seemed to mainly be a problem with atom typing. There were two existing classes that could assigned MMFF atom types using a combination of a decision tree and string matching hose codes. The 761 molecules from the MMFF94 Validation Suite provided by Paul Kersey were used to give a more comprehensive overview then our current tests.

The results showed reasonable precision per-atom in the validation suite but were less favourable per-molecule, the best implementation assigned types to <90% of the molecules with <16% assigned correctly.

Assigned Types
Correct Types
Assigned Types
Correct Types
ForceFieldConfigurator 15576 90.1% 12932 74.8% 678 89.1% 118 15.5%
MMFF94AtomTypeMatcher 17120 99.1% 12309 71.2% 659 86.6% 75 9.9%
MmffAtomTypeMatcher 17279 100.0% 17279 100.0% 761 100.0% 761 100.0%

I wasn't keen to hard code the atom typing procedure but was delighted to find Robert Hanson of JMol had some SMARTS patterns that could be used as a starting point. After about a month of tweaking I managed to simplify the SMARTS patterns and achieve 100% precision on the validation suite. You can find the SMARTS patterns here: /org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma.

Apart from improving atom type assignments the charge assignment also needed updating to include charge sharing and bond class differences. This wasn't quite as simple as I first thought as the parameter set parsing also needed reworking. After many months of analysis paralysis I decided last week to just rewrite what was needed and delegate calls from the existing implementation.

Now the patch is finished, charge assignments are much better. Notice that in the previous version (labelled CDK 1.5.10) equivalent terminal oxygens and the nitrogens in imidazole anion have different values. The overall charge was also inconsistent with the formal charges.

Improved charge assignment

Roger Sayle noted to me this week that MMFF charges should not be affected by representation, for example, charge separated pi bonds in nitro groups or phosphates.

Charges are independant of representation

Many thanks to Mark and Alison Choy for reporting the problem and adding patches for debugging and testing.

Thursday, 29 January 2015

PhD Thesis Now Available

I'm please to announce that my PhD thesis is now available from the Cambridge DSpace repository: One thing potentially of note is the description of fast Kekulisation that I originally intended to write as a blog post. Also following up from NextMove Software's recent post by Daniel on Cahn-Ingold-Prelog (CIP), the results of Chapter 6 contains some more CIP madness.

Tuesday, 30 December 2014

CDK Release 1.5.10


CDK 1.5.10 has been released and is available from sourceforge (download here) and the Maven central repository (XML 1).

This release follows very shortly after 1.5.9 and is the first release available from the central maven repository. This means there is now no need to include a custom repo when using the library in downstream projects (XML 1)

The short release notes (1.5.10-Release-Notes) summarise and detail the changes. Other than the availability in the central repository the release includes a new MolecularFormulaGenerator contributed by Tomáš Pluskal that provide mass to formula generation in a fraction of the time of the old MassToFormulaTool.

XML 1 - Maven POM configuration

Wednesday, 24 December 2014

CDK Release 1.5.9


CDK 1.5.9 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

This is the first release to be built using Java 7 and will require the Java SE Runtime 7 to execute. The previous release (1.5.8) will be the last to work with Java SE 6.

The full release notes (1.5.9-Release-Notes) summarise and detail the changes. One of the new features is the recognition of perspective projection stereochemistry.

Stereochemistry recognition
XML 1 - Maven POM configuration

Monday, 1 December 2014

Memory Mapped Fingerprint Index - Part II

This post follows up on the previous to report some timings. I've checked all the code into GitHub (johnmay/efficient-bits/fp-idx) and it has some stand alone programs that can be run from the command line.

Currently there are a few limitations that we'll get out the way:

  • Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what Matt used in MongoDB (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.
  • Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.
  • During index searching the entire index is memory mapped, setting the chunks system property (see the GitHub readme) will avoid this at a slight performance cost.
  • Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by mkidx). 
  • Index update operations are not supported without rebuilding it.
These are all pretty trivial to resolve and I've simply omitted them due to time. With that done, here's a quick synopsis of making the index, there is more in the GitHub readme.
Code 1 - Synopsis
$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins
$ ./mkidx chembl_19.fps chembl_19.idx # seconds

The fpsscan does a linear search computing all Tanimoto's and outputting the lines that are above a certain threshold. The simmer and toper utils use the index, they either filter for similarity or the top k results. They can take multiple SMILES via the command line or from a file.

Code 2 - Running queries
$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second
$ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second
$ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)

Using the same queries from the MongoDB search I get the following distribution of search times for different thresholds.

Some median search times are as follows.
Threshold Median time (ms)
0.90 14
0.80 31
0.70 46
0.60 53
In the box plot above the same (first) query is always the slowest, this is likely due to JIT.
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.

The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top k queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.

Code 3 - Top 10,000 hits for queries (same as before)
$ ./toper chembl_19.idx 10000 queries.smi

Next I'll look at some like-for-like comparisons.