Wednesday, 26 November 2014

Memory Mapped Fingerprint Index - Part I

I attended an interesting talk this afternoon (CCNM) by Matt Swain on using MongoDB for chemical similarity searching (code: github/mcs07/mongodb-chemistry).

The similarity searching partially uses the "Baldi" algorithm with some extra tweaks based on checking rare bits. The Baldi method is nicely summarised along with others by Tim Vandermeersch in his post on Fingerprint Searching Using Various Indexing Methods. As is noted by Tim, it can be improved upon.

Anyways, I had an implementation of a memory mapped Baldi index lying around, there is also one in the OrChem database cartridge.  I prototyped the implementation back in April and was/is part of a "nfp" (new fingerprint) module for CDK. I've now put the code on a GitHub project (github/johnmay/efficient-bits/fp-idx) and will do some benchmarking to see how it does.

My feeling is that the very simple (it's about 100 lines) memory mapped index can give competitive performance on small datasets (<5 million entries).

Sunday, 16 November 2014

Fun (and abuse) of implicit methods

Earlier this year I wrote up some Chemical Toolkit Rosetta examples of using the CDK in Scala (github/cdk/cdk-scala-examples). When I was writing this it sprung to mind that it would be cool to (ab)use one feature for interoperability between cheminformatics toolkits.

Scala is a statically typed functional language that runs on the Java Virtual Machine. It has some nice features and syntax that can produce some very concise code. One thing particular neat is the ability to define implicit methods. Essentially these are methods that define how to convert between types, they are implicit because the compiler can insert them automatically.

Implicit methods are very similar to auto(un)boxing that was introduced in Java 5 to simplify conversion of primitives and their instance wrappers (Code 1).

Code 1 - Autoboxing and autounboxing in Java
Integer x = 5; // ~ Integer x = Integer.valueOf(5);
int y = x;     // ~ int y = x.intValue();
x = y;         // ~ x = Integer.valueOf(y);

if (x == y) {  // ~ x.intValue() == y 

Much like it is possible in some programming languages to define custom operators, Scala makes it possible to define custom conversions that are inserted at compile time. The main advantage is it allows APIs to be extended to accept different types without introducing extra methods.

Conversion from line notations

Line notations are a concise means of encoding a chemical structure as sequence of characters (String). Common examples include SMILES, InChI*, WLN, SLN, and systematic nomenclature. Conversion to and from these formats isn't too computationally expensive but probably not something you want to do on-the-fly. However, just for fun, let's see what an implicit method for converting from strings can do. First we need the specified methods for loading from a known string type. We'll use the CDK for SMILES and InChI with Opsin for nomenclature.

Code 2a - Parsing of linear notations
val bldr = SilentChemObjectBuilder.getInstance
val sp   = new SmilesParser(bldr)
val igf  = InChIGeneratorFactory.getInstance

def inchipar(inchi: String) = 
  igf.getInChIToStructure(inchi, bldr).getAtomContainer

def cdksmipar(smi: String) = 

def nompar(nom: String) = 

def cansmi(ac: IAtomContainer) =
// Universal SMILES (see. O'Boyle N, 2012**)
def unismi(ac: IAtomContainer) = 
Code 2b - Implicit conversion from a String to an IAtomContainer
implicit def autoParseCDK(str: String): IAtomContainer = {
    if (str.startsWith("InChI=")) { 
    } else if (str.startsWith("1S/")) {
      inchipar("InChI=" + str)
    } else {
      try {
      } catch {
        case _: InvalidSmilesException => nompar(str)

Now the implicit method has been defined, any method in the CDK API that accepts an IAtomContainer can now behave as though it accepts a linear notation. Code 3 shows how we can get the same Universal SMILES for different representations of caffeine and compute the ECFP4 fingerprint for porphyrin

Code 3 - Using implicit methods
val fp = new CircularFingerprinter(CLASS_ECFP4).getCountFingerprint("porphyrin")

Conversion between toolkits

It is also possible to add in implicit methods to auto-convert between toolkit types. To convert between the CDK and RDKit (Java bindings) I'll go via SMILES. This conversion is lossy without an auxiliary data vector but serves as a proof of concept. I've lifted the Java bindings from the RDKit lucene project (github/rdkit/org.rdkit.lucene/) as the shared library works out the box for me. We can also add in the from string implicit conversions.

Code 4 shows the implicit method definitions. The additional autoParseRDKit allows us to bootstrap the RDKit API to also accept linear notations on all methods that expect an RWMol (or ROMol).

Code 4 - Implicit methods for conversion between CDK and RDKit
implicit def cdk2rdkit(ac : IAtomContainer) : RWMol = 

// XXX: better to use non-canon SMILES
implicit def rdkit2cdk(rwmol : RWMol) : IAtomContainer = 

implicit def autoParseRDKit(str: String): RWMol = 

Now we can obtain the Avalon fingerprint of caffeine from it's name and pass an RWMol to the CDK's PubchemFingerprinter (Code 5).

Code 5 - Using the RDKit API
val fp = new ExplicitBitVect(512)
RDKFuncs.getAvalonFP("caffeine", fp2)

val caffeine : RWMol = "caffeine"
new PubchemFingerprinter(bldr).getBitFingerprint(caffeine)

Given that auto(un)boxing primitives in Java can sting you in performance critical code, the above examples should be used sparingly. They do serve as a fun example of what is possible and I've put together the working code example in a Scala project for others to try github/johnmay/efficient-bits/impl-conversion.