Wednesday, 26 November 2014

Memory Mapped Fingerprint Index - Part I

I attended an interesting talk this afternoon (CCNM) by Matt Swain on using MongoDB for chemical similarity searching (code: github/mcs07/mongodb-chemistry).

The similarity searching partially uses the "Baldi" algorithm with some extra tweaks based on checking rare bits. The Baldi method is nicely summarised along with others by Tim Vandermeersch in his post on Fingerprint Searching Using Various Indexing Methods. As is noted by Tim, it can be improved upon.

Anyways, I had an implementation of a memory mapped Baldi index lying around, there is also one in the OrChem database cartridge.  I prototyped the implementation back in April and was/is part of a "nfp" (new fingerprint) module for CDK. I've now put the code on a GitHub project (github/johnmay/efficient-bits/fp-idx) and will do some benchmarking to see how it does.

My feeling is that the very simple (it's about 100 lines) memory mapped index can give competitive performance on small datasets (<5 million entries).