Now that I am officially a Google Summer of Code mentor for
KDE's participation, it was more than time to get my KDE4 install up to date. Meanwhile,
Jos'
Strigi toolkit is well integrated already, and
Jerome has updated the
chemical kfile plugins to the new Strigi based architecture.
I was talking to Phreedom on IRC about ontologies used by Strigi, and added
one for chemistry. It currently has the fields
chemistry.inchi,
chemistry.molecular_formula,
chemistry.molecular_weight,
chemistry.pdbid, and
chemistry.xray_resolution, but more are expected to be added. I already updated kfile_chemical to make use of these fields, and updated it for a few fields from the more generic ontologies in Strigi.
Extracted metadataStrigi currently focusses on metadata only, as do the kfile_chemical plugins: they extract metadata from the file, and do not generally
create metadata based on the file (actually, Strigi calculates
sha1 hashes). These are typically fields like molecular formula, title, X-Ray resolution (in case of PDB files), identifiers (e.g. InChI, PDB id), etc. However, there can be a lot more interesting information in those files, which require some more tought. For example, PDB files cite one or more publications, which might be present at ones hard disk too. The idea is here, that Strigi actually links the PDF with the publication and the PDB file. This is where
Nepomuk comes in, and where Strigi is currently disabled. Similarly, any general organic chemistry publication will mention many molecules, each of which might have other publications discussing them, or even have 3D coordinates or other properties defined.
Created metadataAnother interesting thing one can do for chemical documents, is calculate metadata: for example, calculate
InChI's for mol/xyz/hin/... files, using
OpenBabel. Or
Rule-of-Five properties, e.g. using the
CDK. This is where the GSoC project comes in which I am mentoring, and on which
Alexandr (a former
CUBIC student) is going to work.
Oh, and like most desktop search tools, it can simply work on your HTML cache too, so that all these cool things will work on the webpages you search too. That should trigger some more ideas :) It does for me at least.