Saturday, April 14, 2007

GSoC: towards a chemical semantic desktop

Now that I am officially a Google Summer of Code mentor for KDE's participation, it was more than time to get my KDE4 install up to date. Meanwhile, Jos' Strigi toolkit is well integrated already, and Jerome has updated the chemical kfile plugins to the new Strigi based architecture.

I was talking to Phreedom on IRC about ontologies used by Strigi, and added one for chemistry. It currently has the fields chemistry.inchi, chemistry.molecular_formula, chemistry.molecular_weight, chemistry.pdbid, and chemistry.xray_resolution, but more are expected to be added. I already updated kfile_chemical to make use of these fields, and updated it for a few fields from the more generic ontologies in Strigi.

Extracted metadata
Strigi currently focusses on metadata only, as do the kfile_chemical plugins: they extract metadata from the file, and do not generally create metadata based on the file (actually, Strigi calculates sha1 hashes). These are typically fields like molecular formula, title, X-Ray resolution (in case of PDB files), identifiers (e.g. InChI, PDB id), etc. However, there can be a lot more interesting information in those files, which require some more tought. For example, PDB files cite one or more publications, which might be present at ones hard disk too. The idea is here, that Strigi actually links the PDF with the publication and the PDB file. This is where Nepomuk comes in, and where Strigi is currently disabled. Similarly, any general organic chemistry publication will mention many molecules, each of which might have other publications discussing them, or even have 3D coordinates or other properties defined.

Created metadata
Another interesting thing one can do for chemical documents, is calculate metadata: for example, calculate InChI's for mol/xyz/hin/... files, using OpenBabel. Or Rule-of-Five properties, e.g. using the CDK. This is where the GSoC project comes in which I am mentoring, and on which Alexandr (a former CUBIC student) is going to work.

Oh, and like most desktop search tools, it can simply work on your HTML cache too, so that all these cool things will work on the webpages you search too. That should trigger some more ideas :) It does for me at least.


Geoff H said...

I can say that if the performance of Strigi and Apple's Spotlight for indexing is at all similar, using Open Babel is quite fast.

The real trick is debugging. I've found all sorts of bugs, both in the ChemSpotlight indexing code and in Open Babel from indexing a huge pile of files. It's one thing when you index a folder of PubChem files. It's another when you index your whole drive which has all sorts of strange things. :-)

I also know that many people have liked having residue sequences for PDB and Mol2 files.

So there's plenty to brainstorm!

Egon Willighagen said...

Strigi comes with a few command line tools, which make debugging quite easy. Look for the xmlindexer.

About those sequence... so you suggest to put them in as one word? And use a substring match?