Friday, April 27, 2007

GSoC Meeting with Alexandr

Yesterday I met with Alexandr to discuss things around his GSoC project, like time schedule etc. During the mentors meeting to work out the final rankings, one fellow mentor argued that this project is too specialized for KDE. We, therefore, discussed how we can maximize the effect on the rest of the KDE project, and ideas that came up include a dedicated query tool for complex data (such as chemical data). Anyway, this will be discussed in our blogs soon.

Meanwhile, I have registered to the new Planet SoC which was announced on the Summer of Code Blog.

Saturday, April 14, 2007

GSoC: towards a chemical semantic desktop

Now that I am officially a Google Summer of Code mentor for KDE's participation, it was more than time to get my KDE4 install up to date. Meanwhile, Jos' Strigi toolkit is well integrated already, and Jerome has updated the chemical kfile plugins to the new Strigi based architecture.

I was talking to Phreedom on IRC about ontologies used by Strigi, and added one for chemistry. It currently has the fields chemistry.inchi, chemistry.molecular_formula, chemistry.molecular_weight, chemistry.pdbid, and chemistry.xray_resolution, but more are expected to be added. I already updated kfile_chemical to make use of these fields, and updated it for a few fields from the more generic ontologies in Strigi.

Extracted metadata
Strigi currently focusses on metadata only, as do the kfile_chemical plugins: they extract metadata from the file, and do not generally create metadata based on the file (actually, Strigi calculates sha1 hashes). These are typically fields like molecular formula, title, X-Ray resolution (in case of PDB files), identifiers (e.g. InChI, PDB id), etc. However, there can be a lot more interesting information in those files, which require some more tought. For example, PDB files cite one or more publications, which might be present at ones hard disk too. The idea is here, that Strigi actually links the PDF with the publication and the PDB file. This is where Nepomuk comes in, and where Strigi is currently disabled. Similarly, any general organic chemistry publication will mention many molecules, each of which might have other publications discussing them, or even have 3D coordinates or other properties defined.

Created metadata
Another interesting thing one can do for chemical documents, is calculate metadata: for example, calculate InChI's for mol/xyz/hin/... files, using OpenBabel. Or Rule-of-Five properties, e.g. using the CDK. This is where the GSoC project comes in which I am mentoring, and on which Alexandr (a former CUBIC student) is going to work.

Oh, and like most desktop search tools, it can simply work on your HTML cache too, so that all these cool things will work on the webpages you search too. That should trigger some more ideas :) It does for me at least.

Saturday, April 07, 2007

A Chemical KDE desktop: the Google SoC

Two Google Summer of Code ideas have been written up for the KDE project, and students wrote 10 applications based on those. Today is an important day, as the final ranking will be determined which is send of to Google. See my bits on this some days ago, and earlier in this blog. Both ideas have a reasonable chance of getting one student accepted, but the final decisions will not be clear and made public before 11 April.