Bug with ‘swo2′ and ‘swo’ prefix fixed

While examining the code today, I noticed that swo_core had incorrect ontology prefixes ‘swo’ and ‘swo2′ set to ‘http://www.ebi.ac.uk/swo/http://www.ebi.ac.uk/efo/swo/’. This error was only present in an obsolete term SWO_0000399 (I must have introduced a bug when I made that term obsolete). All instances of this incorrect prefix were removed. Additionally, SWO_0000399 was incorrectly written as SWO__0000399 (with two underscores), and this was also fixed. These fixes were committed at revision 258 of the sourceforge SVN respository.

Posted in Uncategorized | Leave a comment

SWO imports edam.owl rather than edam sub-modules

When the original merge between SWO and EDAM occurred, EDAM in OWL was stored in separate modules: edam_core, edam_data, edam_obsoletes, edam_operations, and edam_topics. These have now been superceded by a single file to import, edam.owl. As such, all module files have been deleted (via “svn delete”) from the subversion repository, although of course they can be recovered by checking out a revision prior to 257 at any time.

Posted in Uncategorized | Leave a comment

Bug with ‘data’ prefix fixed

While examining the code today, I noticed that there were both ‘data’ and ‘data2′ prefixes in swo_core.owl and swo_data.owl. For some reason, both swo_core and swo_data had incorrect ontology prefixes ‘data’ and ‘data2′ set to “http://www.ebi.ac.uk/swo/data/http://www.ebi.ac.uk/swo/data/”, which is a weird duplication of the correct URL. All instances of this incorrect prefix were removed from both files. swo_data remains consistent via HerMIT reasoner from within Protege 4. This fix was committed at revision 256 of the sourceforge SVN respository.

Posted in Uncategorized | Leave a comment

SWO-EDAM Merge: Issues Arising from Data Merge

  1. In the longer term, Parameter and Report may be roles rather than classes within Data. For now, however, they remain in their original location within SWO.
  2. Additionally, in the near-to-medium future, the Core data class may become obsolete, and all children of Core data would be moved to be children of Data directly. This fits with what Robert thinks should be done.
  3. There is a serious question about SWO data classes such as AP-MS data. This data class describes a very specific subset of data, and Helen has concerns about whether classes like this should be present. Similarly, classes such as CSV data set may not be required, as this could equally well be described with an anonymous data class which has a format of CSV. Indeed, many pieces of data could be described in this way, requiring a very clear definition of when it is appropriate for data classes to be created. I had imagined that broad classes of data (which could have many formats) are what belongs in a data hierarchy. Examples of this would include microarray data and image data. The result of better defining what should be a data class should be added to the ontology in a comment label of the data class, and perhaps a blog post about it too (the comment could just reference the blog post).
  4. Although HTML report does seem to be a report rather than a member of data, I have left it where it is in the hierarchy until the EDAM hierarchy (and how Reports are to be modelled) has been decided.
  5. Meta data is modelled within data for SWO, and as a child of Report within EDAM. Therefore, for the same reasons that no reconciliation of HTML report is being performed, until the new modelling of the Report hierarchy has been decided, I see no reason to consider moving the SWO class.
Posted in ontology | Tagged , , | 2 Comments

SWO-EDAM Merge: merging data hierarchies

There are fewer classes within SWO data hierarchy compared with the SWO data format specification hierarchy, and therefore this merge step was simpler. A list of issues arising is available in a separate post. The following changes were made:

  • Made EDAM:Data a child of “Information content entity” (via subclass axiom stored within SWO_core.owl)
  • Made EDAM:Data equivalent to IAO:0000027(data) – (via equivalent class statement)
  • SWO classes obsoleted due to equivalence with EDAM:
    • Image (efo/swo/SWO_0000580) with Image (data_2968).
    • Heatmap (efo/swo/SWO_0000571) with Heat map (data_1636). As the EDAM class is a child of Microarray image, the subclass axiom relating the SWO class to data has been removed as it is now unnecessary.
    • Microarray data (efo/swo/SWO_0000624) with Microarray data (data_2603).
    • ontology (swo/ontology) with Ontology (data_0582).
Posted in ontology | Tagged , , | 1 Comment

SWO-EDAM Merge: General issues arising

This post will be updated as work on the SWO-EDAM merge progresses, and describes issues which need to be addressed but which don’t belong to any one particular hierarchy. Some of the points raised here are just my opinion, and may not be the right decision in the end.

  • Version information is referenced as a Report in EDAM, and differently in SWO. These should be resolved.
  • There is a lot of duplicated hierarchies of taxonomy within the Topic hierarchy. The same is true of lots of other mini-hierarchies, like the ones for UniProt IDs. This creates difficulties in maintenance (both in ensuring all hierarchies are updated equally, and in readability) and in front-facing use (lots of classes with identical labels and different meanings and purposes). For instance, the UniProt ID mini-hierarchy is present in at least 5 different places within the data hierarchy. Another example is the “completely unambiguous pure” mini hierarchy, which is present at least 3 times in the format hierarchy.
  • There are a number of formats referenced within SWO via, e.g.,  X ‘has specified input’ some ‘formatY’. This is odd, as it seems (though isn’t axiomised anywhere) that software should only be related to format via has format specification and perhaps a data type. However, as there are no restrictions as to how has specified input (or output) are used, we get multiple different axioms for linking software and data formats together. This needs to be more rigorously defined
  • has format specification (from SWO) and has format (from EDAM) should be aligned and merged, once the operation/software issue has been resolved.
  • In my opinion, there should only be a single asserted hierarchy within EDAM, and in many places (e.g. Format) there is a deliberate placing of all child classes within a broader, biological context (e.g. Format (typed)) and a more basic format (e.g. Textual format). I believe that it should be relatively straightforward to make the classes within Format (typed) defined classes, and use the reasoner to place all formats into the appropriate biological format type.
  • Within EDAM, there are many instances where two different classes have the same label/name. One example is Ontology the data class (data_0582), and Ontology the topic (topic_0089). This is a Bad Thing, and should be fixed in all cases.
  • When the inferred hierarchy is calculated in Protege, the EDAM Data and EDAM Operation classes are inferred to be equivalent. This is definitely not true, and has the knock-on effect that the SWO data class is also inferred to be equivalent to Operation (because EDAM Data and SWO data are asserted to be equivalent). This need to be fixed as early as possible.
Posted in ontology | Tagged , | 1 Comment

SWO-EDAM Merge: Issues arising from Format Merge

Until the EDAM OWL files become the master versions of the ontology, it is not useful to obsolete EDAM classes and replace them with the appropriate SWO classes. Instead, a number of equivalence statements were used. Once the EDAM OWL files are the master versions of the ontology (rather than the OBO files), then the following changes will be made to get rid of these equivalence statements:

  1. Format (EDAM) will be obsoleted and replaced with Data format specification (SWO). Both labels will be retained, with “Format” as an alternative term.
  2. Textual format (EDAM) will be obsoleted and replaced with Text file format (SWO) – here, the EDAM annotation label may be the preferred label.
  3. XML (EDAM) will be obsoleted and replaced with XML (SWO).
  4. MAGE-TAB (EDAM) will drop the subclass statement for Textual format, and will instead only be a subclass of Tab delimited file format (and, for now, Format (typed)),

The following classes within SWO should either be further classified deeper in the data format specification hierarchy or, if we remain unsure what formats they actually are, should be obsoleted:

  1. .data format, and I am unsure how to further classify it without knowing exactly which format is being described
  2. .rma format (as it seems to be an algorithm for affy data rather than a format);
  3. Xba.CQV and Xba.regions (some kind of internal format to the RLMM Bioconductor package I think?);
  4. chamber slide format (used for splots Bioconductor package);
  5. covdesc (described in relation to Bioconductor on one or two websites, but no clear description of what it is: http://permalink.gmane.org/gmane.science.biology.informatics.conductor/34506);
  6. design file and pair file (both part of the HELP system – you try searching for that!);
  7. gmt format (there are no usages in SWO of this class, and you get the time zone as a result when searching);
  8. log file and pedigree file (i.e. patient information) seem just wrong as formats and perhaps should be done away with, except perhaps as roles;
  9. logicFS dataset belongs in data, if it belongs anywhere (currently in data format specification) – even the axiom referencing it doesn’t seem right;
  10. sproc (which could mean a stored procedure, or perhaps a custom file within Bioconductor, but it’s hard to tell, and even harder to tell the format if it is a Bioconductor package input);
  11. sqlite is a program / piece of software, not a format, but it could be that there is an sqlite data dump that is a specific format?

Additionally, within SWO I think it might be better to make the outlines (specifically, the class Outline document format) a role rather than a format, as its child classes (e.g. OPML) should instead be stored within their format type hierarchy. In the case of OPML, this would be as a child of XML. This may or may not also be appropriate for Document exchange format and its hierarchy.

Within EDAM, BioPAX is subclassed as a child of OWL within the Format hierarchy. This is not particularly useful, as OWL can have multiple serializations in different formats. I believe that the EDAM BioPAX should be obsoleted in favor of one of the various BioPAX formats described in SWO (either the Manchester OWL or the RDF/XML versions, which have IDs of http://www.ebi.ac.uk/swo/data/SWO_3000056 and http://www.ebi.ac.uk/swo/data/SWO_3000055).

Also, a broader decision needs to be made on the placement of the HTML class within EDAM; within SWO it is a child of Web page specification, which itself is a child of data format specification. Within EDAM, it is a direct child of Format. EDAM may need to add the Web page specification class so that, ultimately, the final merging (rather than the current temporary method of just marking the two classes as equivalent) of the format hierarchies won’t end up with HTML in two different places within the final data format hierarchy.

Posted in ontology | Tagged , , | 2 Comments

SWO-EDAM Merge: Merging format hierarchies

Import of EDAM:Format into SWO:Data format specification

The following steps have been performed and committed within the development subdirectory of the SWO subversion repository. This merging is not complete (due to the EDAM OWL file not being the master version of the file yet), and there were a number of issues resulting from these steps. Both of these topics are itemized in an additional blog post.

  • Make EDAM:Format a child of “Information content entity” (via subclass statement)
  • Make EDAM:Format equivalent to IAO:0000098 (data format specification) – (via equivalent class statement)
  • SWO data format specification classes moved to other locations within SWO:
    • Renamed Tab delimited file to Tab delimited format
    • Split CDF into its two possible formats: CDF ASCII format and CDF binary format
    • Classes moved to be children of XML: MAGE-ML files (also renamed to MAGE-ML); KGML files (also renamed to renamed to KGML); ARR; gxl format;
    • Classes moved to be children of Text file format: SBMLR format (renamed from SBMLR file), which does not have a definition but seems to be the R files containing SBML data as specificied by the SBMLR package for R; SDF format; Rnw (a type of Sweave document); R data frame (a bit like a table in R); GEO Matrix Series File (also replaced “file “ with “format”); FCS; FASTA; CDF ASCII format; BED file (also renamed to BED format); cls; dcf; gff format; mas5 format; rda.
    • Classes moved to be children of Tab delimited format: MAGE tab format; cdt; gct; gpr format; gtr.
    • Classes moved to be children of Programming language format: Matlab m file (also renamed to Matlab .m file).
    • Moved under EDAM:format_2333 (Binary format): CEL; CHP file; CDF Binary format; BPMAP; lma.
  • Refactoring to remove EDAM:SWO duplicates – by obsoletion of EDAM or SWO concept as appropriate:
    • RDF: Present in both EDAM (format_2376) and SWO (SWO_3000006). The SWO term is retained, as EDAM is intended as a more bioinformatics-specific ontology, and the RDF hierarchy within SWO is more complex. All annotations have been moved across to the SWO RDF class and the EDAM RDF class has been marked as obsolete.
    • Document exchange format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Image format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Outline document format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is. However, it is recommended that outlining become a role, as classes like OPML should instead be stored within the XML hierarchy.
    • Programming language format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Spreadsheet and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Text file format (SWO_3000041) and Textual format (format_2330) are equivalent, and are marked with equivalence statements. Each have a hierarchy of children which must be taken into account. Obsolete cases are itemized below, while all others remain in their original hierarchy for now. Please note that the obsoleting of terms occurs by refactoring the URI of the concept to be obsoleted to the URI of the class it is being merged with. Until the two classes Text file format and Textual formatare properly merged/obsoleted, the (slightly messy) dual hierarchy will remain. However, the inferred hierarchy will show the complete set of children (from both ontologies) under both classes.
      • BED format (efo/swo/SWO_0000051) and BED (format_3003) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations.
      • FASTA (efo/swo/SWO_0000142) and FASTA format (format_1929). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations and is present within a more complete FASTA hierarchy.
      • OBO Flat File Format (swo/data/SWO_3000040) and OBO (format_2549). The EDAM class will remain and the SWO class will be obsoleted, the SWO label (which is highly descriptive), will be retained.
      • gff format (efo/swo/SWO_0000559) and GFF (format_2305). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations and is present within a more complete GFF hierarchy.
      • newick (efo/swo/SWO_0000634) and newick (format_1910). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations.
      • MAGE tab format (swo/data/SWO_3000045) and MAGE-TAB (format_3162). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This creates an extra parent class for MAGE-TAB (Tab delimited file format, which is the parent class for the SWO MAGE tab format). Once the EDAM OWL files become the master version of the files, we can fix this by removing the Textual Format subclass statement.
    • Web page specification in SWO has as its only child the EDAM HTML hierarchy. As such, no changes need to be made within SWO.
    • Word processing document format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • XML (swo/data/SWO_3000005) and XML (format_2332) are equivalent, and are marked with equivalence statements. Each have a hierarchy of children which must be taken into account. Obsolete cases are itemized below, while all others remain in their original hierarchy for now. Please note that the obsoleting of terms occurs by refactoring the class URI to the URI of the class it is being merged with. Until the two XML classes are properly merged/obsoleted, the (slightly messy) dual hierarchy will remain. However, the inferred hierarchy will show the complete set of children (from both ontologies) under both classes.
      • MAGE-ML (efo/swo/SWO_0000268) and MAGE-ML (format_3161) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class.
      • PSI-MI format (swo/data/SWO_3000048) and PSI MI XML (MIF) (format_3158) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class. Retained SWO label as an alternative term.
      • SBML (swo/data/SWO_3000037) and SBML (format_2585) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class.
Posted in ontology | Tagged , , , | 1 Comment

SWO-EDAM Merge: Modifying EDAM in OWL

While on the whole the conversion process between OBO and OWL (in either direction) is relatively good these days using Protege 4.1 (which makes use of the OWLAPI underneath), there were some manual changes required. This post describes all changes made to the automatically-generated EDAM OWL file prior to importing into SWO.

  1. Change the (initially anonymous) ontology URI to http://edamontology.org within the EDAM OWL file.
  2. Replace the erroneous URI created by the Protege conversion for DeprecatedClass to the correct OWL URI of http://www.w3.org/2002/07/owl#DeprecatedClass
  3. The conversion process also creates an incorrect “owl2” namespace for all attributes of properties (e.g. reflexive, transitive, see also the point directly above) as follows:
    <!ENTITY owl2 “http://purl.obolibrary.org/obo/http_//purl.org/obo/owl:&#8221; >
    This was fixed manually by removing the incorrect namespace and replacing all instances of “owl2” with “owl”. However, when the file is saved again in Protege, even though the owl2 namespace isn’t used at all any more, it is still re-created. It was likely due to the following lines in the OWL file, which have now had the incorrect string  “http_//purl.org/obo/owl:” removed:
    <owl:AnnotationProperty rdf:about=”&obo;http_//purl.org/obo/owl:is_symmetric”/>
    <owl:AnnotationProperty rdf:about=”&obo;http_//purl.org/obo/owl:is_reflexive”/>
  4. Replaced incorrect URI for definition_editor with the correct one: http://www.ebi.ac.uk/efo/definition_editor).
  5. All of the inverse, range and domain statements are stored as annotations by default when the conversion process happens, rather than being stored as OWL descriptions. For cases such as this:
    <obo:inverse_of>EDAM:is_format_of</obo:inverse_of>
    additional statements such as the following were added:
    <owl:inverseOf rdf:resource=”&obo;EDAM_is_format_of”/>
    Nothing was deleted so that conversion back to OBO would be OK.
  6. The URI of all EDAM classes needed to be fixed after conversion from OBO, as the default URI for all classes is the purl.obolibrary.org-based URI. The global search and replaces performed are:
    • Remove “obo” namespace, as these are all incorrect, and should instead be using the edam default namespace. In practice, this means deleting all instances of the strings “obo:” and “&obo;”, as well as the declaration of the associated namespace at the top of the file.
    • Remove the string “EDAM_” from anywhere in the OWL file, as this is now unnecessary, and if it remains, then the URIs for the class names are incorrect.
  7. Separated out all modules (data (containing both data and format hierarchies), operation, topic), into their own files. These files are named edam_data.owl, edam_operations.owl, edam_obsoletes.owl and edam_topics.owl. While modularizing edam_obsoletes.owl, one leftover axiom was identified on one of the obsoleted classes. At Jon Ison’s request, I have deleted this axiom: Genotype and phenotype annotation format (an obsolete class) SubClassOf ‘is format of’ some Genotype/Phenotype annotation (not obsolete).
  8. An additional edam_core.owl file was made to easily view EDAM on its own and to store all cross-file / cross-module axioms. By storing cross-file axioms in the edam_core.owl, the individual module files are much cleaner. The edam_core.owl file is then imported into SWO. If in future we wish to move all cross-file axioms out of edam_core.owl, then this can be done, and the three module files could be imported directly.
  9. Import edam_core.owl into SWO. This simplistically-merged ontology reasons with no inconsistencies (although there are some interesting features of EDAM in OWL that Jon will want to take a look at, particularly the inference that Data and Operation are equivalent).
Posted in ontology | Tagged , , , , , | 1 Comment

SWO-EDAM Merge: Converting EDAM from OBO to OWL

The majority of the work described in this post was performed by Jon Ison.

One-time changes made in the EDAM OBO file

  • Various subsets of terms within EDAM were defined using the http://edamontology.org#subsetannotation type. The following subsets are defined (with the last four in the list already defined in EDAM):
    • edam
    • bioinformatics
    • data
    • operations
    • formats
    • topics
  • The http://www.ebi.ac.uk/efo/definition_editor annotation property was added to all EDAM classes/properties that have definition. This provides authorship in a manner in line with the method already employed within SWO.
  • Created the same obsolete parent class within EDAM as is used within SWO (http://www.w3.org/2002/07/owl#DeprecatedClass).
  • The OWLAPI library (and therefore Protege 4.1) does not currently convert the OBO trailing modifier to OWL (see Protein interaction network rendering). The solution is to convert the trailing modifiers to new custom annotation fields. While {since} and {note} can be converted to custom annotation fields {min_cardinality} and {max_cardinality} can’t, but these should just be deprecated or deleted as appropriate, as they have been of limited value. In some instances, {since} can be directly replaced with the EFO annotation label http://www.ebi.ac.uk/efo/obsoleted_in_version. If possible, something like created_in_version (if it exists) should be used for the other instances of {since}. {note} cannot be replaced by the standard rdfs comment annotation label as this label is supposed to be reserved for comments about the term or definition itself.
  • The “EDAM” idspace has been explicitly added to all properties (and usages of those properties) within the OBO file prior to conversion using a global search and replace. This is because, although the following two lines exist to describe the default namespace for properties within EDAM, it seems the conversion to OWL doesn’t pick this up and creates an incorrect OWL-based namespace for them all.
    idspace: EDAM http://edamontology.org/ “EDAM Relations and Concept Attributes”
    default-relationship-id-prefix: EDAM
  • Convert the EDAM OBO file into a single OWL file using Protege 4.1
Posted in ontology | Tagged , , , , | 3 Comments