Efficient Bits: February 2014

At 13 years and 4 months the Chemistry Development Kit (CDK) is reasonably mature for a software project. Over the years there have been many changes in development practices as the code base evolved. This post is a departure for the usual algorithms and performance tests and looks at a recent and major change in the CDK development process.

On Monday, Egon, Nina and I made the final alterations that changed the build system from Ant to Maven. This change has been in the works for a long time and has been suggested multiple times. The actually migration has taken about a years worth of planning.

If you want to have a play with the new build system yourself I've put together a brief guide that also describes how to import the project into several popular IDEs - Building CDK. The project README also summarises the command line usage.

I download CDK releases and use it my project, what has changed?

If you are using the CDK as a dependancy you should not notice any difference. The library and bundled dependencies will still be distributed at each release. If you are also using maven then CDK module artefacts have been deployed for last few releases. These are by far the easiest way to use the library as dependency versioning is managed maven and newer releases can be automatically downloaded. Please see the project README for repository details.

I build the CDK source and use it my project, what has changed?

The source code is now built with maven - the README summarises the steps. As with releases, SNAPSHOT artefacts will be deployed to a remote repository (currently EBI).

I have modifications to the CDK that I apply, what has changed?

If your patches are Git commits then these can still be applied. Git will sort out and use the correct file locations to any modified files. If the patch creates new files these will need to be moved manually to the correct location.

CDK Modules in Dec 2013 - Egon W, Bits of Blah

Existing project structure

Prior to Monday the project code was organised under a single root folder. The Ant build would then read instructions in the source code and assemble the modules during compilation. This approach allowed progressive partition the code into modules over an extended period. Without this system we would not have be able to convert to maven at all.

This system was customised and specific to the CDK which, in my opinion, made it a significant barrier to contributions. I know that personally I struggled to understand what was going on at compile time. A highly customised build process makes it not only difficult for a human to comprehend but also any automated tooling (Integrated Development Environments, IDEs). Superficial support has been provided for Eclipse and Netbeans editors but neither correctly interrupted the modules and relationships between them.

Separate source trees

The most noticeable difference in the project is each module now has a separate source tree. This allows easier reasoning about the contents of module and provides a visual cue about the modular structure. Below we can see the 'cip' (Cahn-Ingold-Prelog) module source tree only contains the classes relevant to the module. Separate source trees are not specific to maven and we could still use Ant. The main benefit is that Maven supports and encourages this kind of structuring by default.

Source code in the CIP module

There is still more work to do on the module organisation, for example, CMLWritier is the the 'libiocml' module whilst the CMLReader is the 'io' module. The modules are mainly organised by their dependancies but in future it may be beneficial to organise by function. Normally classes with similar dependencies have similar function but this isn't always the case. An example of this is seen with the LINGOFingerprinter and SignatureFingerprinter in the 'smiles' and 'signatures' modules rather than the 'fingerprint' module.

Super modules

The Maven build also allows us to define groupings of the existing modules. These intermediate modules group the code base in a few digestible sections. You can see these groupings at the root level in the repository - https://github.com/cdk/cdk/. There are five groups and an additional misc/ module for the left overs. I'm planning to write a more in depth guide for the wiki but here is a quick overview of what is present in each.

base/ - API and implementations of domain objects and central algorithms to handling chemical information
descriptor/ - fingerprinters, qsars and signatures for describing and characterising attributes of a compound.
storage/ - reading and writing of chemical compounds from multiple file formats
tool/ - structure diagram generation, smarts, smsd, hashcodes, tautomer and tools that either answer a question directly, manipulate input or compute intrinsic properties
display/ - rendering of 2D depictions

Effectively understanding an algorithm with only a description is difficult. Reading source code, possibly more so. Although these approaches explain the finer details, invariants and proofs, a higher level view offers clarity. A great example of this is seen at McKay's and Piperno's canonical labelling site, The Search Tree.

Tautomers are constitutional isomers of organic compounds that rapidly interconvert (Wikipedia). The most common form, is the relocation of a proton. Many computer representations are tautomer specific and distinguish different tautomeric forms of a compound. They would for example not have the same unique SMILES.

There are several approaches and algorithms for handling tautomers (Warr W, 2010). At the Daylight EuroMUG99, Roger Sayle and Jack Delany presented an algorithm for enumerating and assigning a unique tautomer (Sayle R and Delany J, 1999). From slide 20 in that presentation, there is a very nice step wise example on guanine.

This afternoon I had fun creating a animation for algorithm using an implementation I wrote last year. I've used a more complicated example that emphasises the backtracking when an incorrect assignment is made.

Being a large compound, I didn't include the keto-enol types. Also I had to modify the order (normally the nitrogens would be assigned first) to allow it be watchable in reasonable time. Each proton is placed and the hetroatoms become either a donor (green) or acceptor (blue). At several points it attempts to place two protons in each of the five membered rings. After updating adjacent atoms the mistake is identified and the assignment backtracks setting one as an proton acceptor.

Be sure to set the HD option.

Warr W (2010) shows an example labelled as "hidden tautomers". This compound presents an interesting challenge and a nice demonstration. In total there are 68 tautomers generated.

References

Warr WA. Tautomerism in chemical information management systems. J Comput Aided Mol Des. 24(6-7):497-520. 2010
Also presented at the ChemAxon 2010 UGM - http://www.youtube.com/watch?v=1C-RTD4DAJ8

Sayle RA and Delany JJ. Canonicalization and enumeration of tauomers. Presented at EuroMUG99, Cambridge, UK, 28-29 Oct 1999

Efficient Bits

Wednesday 19 February 2014

CDK now built using Maven