Tuesday, 8 September 2015

Bringing Molfile Sgroups to the CDK - Demo

Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.

At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.

Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.

Substructure (or Substance) Groups

Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.

Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:

  • Display Shortcuts
    • Abbreviations
    • Multiple Groups
  • Polymers
    • Structural Repeat Unit (SRU)
    • Monomer
    • Copolymer (alternating, block, or random)
    • Mer
    • Crosslink
    • Graft
    • Modified
    • Any
  • Mixtures
    • Unordered Mixture
    • Ordered Mixture (formulation)
    • Component
  • Generic
  • Data

Example ChEBI Depictions

Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.

As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.

Display Shortcut, Abbreviations

Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.

CHEBI:29441 CHEBI:7725

Display Shortcut, Multiple Group

Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.

CHEBI:1233 CHEBI:79399

Polymer, SRUs

The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.

CHEBI:16838 CHEBI:4294
CHEBI:53422 CHEBI:59342

Polymer, Others

A few entries encode copolymers and source-based representations (monomer).

CHEBI:59599 CHEBI:3814 (overlap in original)


A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.


Additional Reading

  1. Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991)
  2. Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH
  3. Accelrys Chemical Representation
  4. CTfile Formats Specification

Sunday, 9 August 2015

MMFF Partial Charges Improvements in CDK

Some time last year Mark Williamson brought to my attention discrepancies in CDK's MMFF partial charge calculation. Investigating further it seemed to mainly be a problem with atom typing. There were two existing classes that could assigned MMFF atom types using a combination of a decision tree and string matching hose codes. The 761 molecules from the MMFF94 Validation Suite provided by Paul Kersey were used to give a more comprehensive overview then our current tests.

The results showed reasonable precision per-atom in the validation suite but were less favourable per-molecule, the best implementation assigned types to <90% of the molecules with <16% assigned correctly.

Assigned Types
Correct Types
Assigned Types
Correct Types
ForceFieldConfigurator 15576 90.1% 12932 74.8% 678 89.1% 118 15.5%
MMFF94AtomTypeMatcher 17120 99.1% 12309 71.2% 659 86.6% 75 9.9%
MmffAtomTypeMatcher 17279 100.0% 17279 100.0% 761 100.0% 761 100.0%

I wasn't keen to hard code the atom typing procedure but was delighted to find Robert Hanson of JMol had some SMARTS patterns that could be used as a starting point. After about a month of tweaking I managed to simplify the SMARTS patterns and achieve 100% precision on the validation suite. You can find the SMARTS patterns here: /org/openscience/cdk/forcefield/mmff/MMFFSYMB.sma.

Apart from improving atom type assignments the charge assignment also needed updating to include charge sharing and bond class differences. This wasn't quite as simple as I first thought as the parameter set parsing also needed reworking. After many months of analysis paralysis I decided last week to just rewrite what was needed and delegate calls from the existing implementation.

Now the patch is finished, charge assignments are much better. Notice that in the previous version (labelled CDK 1.5.10) equivalent terminal oxygens and the nitrogens in imidazole anion have different values. The overall charge was also inconsistent with the formal charges.

Improved charge assignment

Roger Sayle noted to me this week that MMFF charges should not be affected by representation, for example, charge separated pi bonds in nitro groups or phosphates.

Charges are independant of representation

Many thanks to Mark and Alison Choy for reporting the problem and adding patches for debugging and testing.

Thursday, 29 January 2015

PhD Thesis Now Available

I'm please to announce that my PhD thesis is now available from the Cambridge DSpace repository: https://www.repository.cam.ac.uk/handle/1810/246652. One thing potentially of note is the description of fast Kekulisation that I originally intended to write as a blog post. Also following up from NextMove Software's recent post by Daniel on Cahn-Ingold-Prelog (CIP), the results of Chapter 6 contains some more CIP madness.

Tuesday, 30 December 2014

CDK Release 1.5.10


CDK 1.5.10 has been released and is available from sourceforge (download here) and the Maven central repository (XML 1).

This release follows very shortly after 1.5.9 and is the first release available from the central maven repository. This means there is now no need to include a custom repo when using the library in downstream projects (XML 1)

The short release notes (1.5.10-Release-Notes) summarise and detail the changes. Other than the availability in the central repository the release includes a new MolecularFormulaGenerator contributed by Tomáš Pluskal that provide mass to formula generation in a fraction of the time of the old MassToFormulaTool.

XML 1 - Maven POM configuration

Wednesday, 24 December 2014

CDK Release 1.5.9


CDK 1.5.9 has been released and is available from sourceforge (download here) and the EBI maven repo (XML 1).

This is the first release to be built using Java 7 and will require the Java SE Runtime 7 to execute. The previous release (1.5.8) will be the last to work with Java SE 6.

The full release notes (1.5.9-Release-Notes) summarise and detail the changes. One of the new features is the recognition of perspective projection stereochemistry.

Stereochemistry recognition
XML 1 - Maven POM configuration