When thinking about Operating Systems recently, the subject of versions came up. This is clearly going to be a very important attribute that we will need to capture in our ontology. All software has versions: rarely only one version (perhaps with a rolling release, like Gentoo Linux, which brings its own complications…), often a handful, and sometimes many (the Google Chrome browser, for example, is updated more or less continuously): new versions may bring tweaks to the UI, bug fixes, algorithmic modifications, feature addition, etc. In the context of data preservation it will sometimes be important to know which version was used. A good example of this was given by Paul Wheatley from the British Library at the JISC preservation tools strand meeting in February. He described how a bug in the Kakadu JPEG encoding software caused problems with some of their scanned data (if I remember correctly, the JPEG output did not contain metadata about the image size). This problem was fixed in a later version of Kakadu, so did not affect all their data, but the only way of identifying images that needed correction was to check for the effect of the bug (the missing metadata). In this situation, associating the data with information about the software used to generate it would only have been useful if the software version had been captured.
Considering our preliminary model of software, we can see that several of the software properties that we are intending to capture may very according to the version:
- Input and output data options may vary : a new version might provide greater import flexibility, for example, or the output file format might change.
- Different versions may employ additional or different algorithms.
- Software licensing may change with a new version (indeed, the version of the license might be different).
- When considering operating systems, different versions of an OS may use different versions of a kernel, or even the same version of an OS may be used with different versions of a kernel.
So we can see that capturing the version of software used can be important for data preservation; how then can we capture this information?
New releases of software come in different types, and appear to fall into a loose hierarchy, with major versions, minor versions, patches, service packs and builds all subtly changing the characteristics of an application. The decision about whether a change warrants a new version number, or just a new build is made by the developers/publishers, and is based on a loose convention. Unfortunately for us, it is not possible to identify any release type that will not contain changes that we are interested in, so our ontology will need the flexibility to describe a version change at even the lowest level.
Modelling this is problematic: capturing each build as a class and duplicating the software properties as necessary does not seem to be an elegant or efficient way of doing this. We need to capture both the differences between versions and the similarities that make them the same application. Not only this, but we also want to be able to handle generalities – if we do not know the version at all, we still want to say which software was used. We will also want to use the sequential nature of versions, e.g., if we wish to be able to infer a range of possible versions given a date. I have a feeling that capturing this sequencing (albeit possibly forked) is probably the key to modelling this effectively. Any thoughts?