Making SWO Satisfiable

Over the past few days I have been working to fix the approximately 170 unsatisfiable classes that were present in SWO. The vast majority of these were due to improper linking of software classes directly to a data format specification, rather than to a data class. One change is detailed below, and the rest of the classes are just named, although the process to fix them was similar.

Unsatisfiable classes: Data is disjoint with “data format specification”, but an axiom on the class uses “has specified data input” (which has a range of data) with a class from “data format specification”.

Affxparser: has specified data input’ some (BPMAP or ‘CDF binary format’ or CEL)
In order to fix this, I created a defined class called Affymetrix-Compliant Data which included any data classes whose format specification is published by Affymetrix, and set all of (CDF binary format, CDF ASCII format, CEL binary format, CEL ASCII format, CHP binary format, BPMAP, BAR) as published by Affymetrix. Resulting modifications to swo_data are:

Class: 'Affymetrix-compliant data'
Annotations:
 label "Affymetrix-compliant data"@en,
 IAO_0000115 "Affymetrix-compliant data is data produced in a format compatible with Affymetrix software. This is a defined class where other data classes will be inferred to be members if they have a data format specification which has been published by Affymetrix.",
 creator "Allyson Lister"
EquivalentTo:
 SWO_0004002 some
 ('data format specification'
 and (SWO_0004004 value SWO_0000023))
SubClassOf:
 data

And I added the axiom “SWO_0004004 value SWO_0000023” (is published by value Affymetrix) to the following classes: CDF binary format, CDF ASCII format, CEL binary format, CEL ASCII format, CHP binary format, BPMAP, BAR.
Then, back in swo_core, the final linkup to affxparser made modifications to that class as follows:

'has specified data input' only 'Affymetrix-compliant data'
'has specified data input' some 'Affymetrix-compliant data'

Domains and ranges were added to is published by as follows:

 Domain:
 'data format specification'
 or software
Range:
 organization

If it turns out that anything else needs to be added to the range of is published by, then we can either 1) remove the constraint on domain, 2) add that class to the domain constraint, or 3) specialise to sub-properties with single class domains and leave the top is published by property undomained.

Similar changes as those above were made to the following classes: affy, affyContam, affyio, affylmGUI, affyPara, affypdnn, affyPLM, affyQCReport, affyTiling, altcdfenvs, annaffy, annotationTools, aroma.light, arrayMvout, arrayQualityMetrics, BAC, betr, bgx, bgafun, bioDist, biomaRT, Category, cghMCR, convert, copa, cosmo, cosmoGUI, crimm, ctc, daMA, dyebias, ecolitk, edd, exonmap, factDesign, fbat, fdrame, flagme, flowQ, flowClust, flowCore, flowFlowJo, flowStats, flowUtils, flowViz, gaga, gcrma, gene2pathway, genefilter, geneRecommender, GeneticsBase, GeneticsPed, genomeIntervals, globaltest, goTools, gpls, graph, hexbin, hypergraph, idiogram, iterativeBMAsurv, keggorth, lapmix, limma, limmaGUI, logicFS, logitT, lumi, maCorrPlot, maDB, makecdfenv, makePlatformDesign, marray, matchprobes, metaArray, metahdep, microRNA, miRNApath, multtest, nnNorm, nudge, occugene, oligo, oneChannelGUI, ontoTools, pamr, panp, parody, pcaMethods, pcot2, pdInfoBuilder, pdmclass, pgUtils, pickgene, pkgDepTools, plgem, plw, ppiStats, prada, preprocessCore, puma, qpgraph, rama, RankProd, rbsurv, Rdbi, RdbiPgSQL, Rdisop, rflowcyt, Rgraphviz, Ringo, Rintact, RLMM, RMAExpress, RMAExpress 2.0, RMAExpress quantification, RMAGEML, rMAT, ROC, RpsiXML, Rredland, rsbml, rtracklayer, Rtreemix, Ruuid, RwebServices, safe, sagenhaft, SAGx, SBMLR, ScISI, seqLogo, ShortRead, simpleaffy, sizepower, SLGI, SLqPCR, SMAP, snapCGH, SNPchip, snpMatrix, SPIA, splicegear, splots, spotSegmentation, sscore, ssize, SSPA, TargetSearch, tilingArray, timecourse, topGO, tspair, twilight, TypeInfo, VanillaICE, vbmp, weaver, webbioc, xcms, xmapbridge, xps, XDE.

Inferred to be a member of a disjoint class
Affymetrix Software

 ('is specified data output of' some
 ('Software publishing process'
 and ('has participant' value Affymetrix)))
 or ('is specified data output of' some
 ('Software development process'
 and ('has participant' value Affymetrix)))

This class is asserted as member of software, but inferred to be member of Data due to usage of is specified data output of . Fixed by changing the entire axiom to

software and 'is published by' value Affymetrix

Bioconductor Software

 ('is specified data output of' some
 ('Software publishing process'
 and ('has participant' value Bioconductor)))
 or ('is specified data output of' some
 ('Software development process'
 and ('has participant' value Bioconductor)))

This class is asserted as member of software, but inferred to be member of Data due to usage of is specified data output of . Fixed by changing the entire axiom to

software and 'is published by' value Bioconductor

Reconciling Annotation Labels

I have also replaced all references to definition_citation (about 214 cases) and definition_editor (270 cases) in the various OWL files. These annotation labels were replaced with IAO 19 (definition source) and dc:creator (definition editor). This matches how the rest of the ontology was already, and the design decision that was made a year ago. Many of the single quotes around existing single-word class labels were also removed, as they were unnecessary.

Obsoleting is_published_by

All references to is_published_by (SWO_0000395) were replaced with is published by (SWO_0004004), and SWO_0000395 was obsoleted.

Advertisements
Posted in data format specification, modeling, ontology, software | Leave a comment

Bug with ‘swo2’ and ‘swo’ prefix fixed

While examining the code today, I noticed that swo_core had incorrect ontology prefixes ‘swo’ and ‘swo2’ set to ‘http://www.ebi.ac.uk/swo/http://www.ebi.ac.uk/efo/swo/’. This error was only present in an obsolete term SWO_0000399 (I must have introduced a bug when I made that term obsolete). All instances of this incorrect prefix were removed. Additionally, SWO_0000399 was incorrectly written as SWO__0000399 (with two underscores), and this was also fixed. These fixes were committed at revision 258 of the sourceforge SVN respository.

Posted in Uncategorized | Leave a comment

SWO imports edam.owl rather than edam sub-modules

When the original merge between SWO and EDAM occurred, EDAM in OWL was stored in separate modules: edam_core, edam_data, edam_obsoletes, edam_operations, and edam_topics. These have now been superceded by a single file to import, edam.owl. As such, all module files have been deleted (via “svn delete”) from the subversion repository, although of course they can be recovered by checking out a revision prior to 257 at any time.

Posted in Uncategorized | Leave a comment

Bug with ‘data’ prefix fixed

While examining the code today, I noticed that there were both ‘data’ and ‘data2’ prefixes in swo_core.owl and swo_data.owl. For some reason, both swo_core and swo_data had incorrect ontology prefixes ‘data’ and ‘data2’ set to “http://www.ebi.ac.uk/swo/data/http://www.ebi.ac.uk/swo/data/”, which is a weird duplication of the correct URL. All instances of this incorrect prefix were removed from both files. swo_data remains consistent via HerMIT reasoner from within Protege 4. This fix was committed at revision 256 of the sourceforge SVN respository.

Posted in Uncategorized | Leave a comment

SWO-EDAM Merge: Issues Arising from Data Merge

  1. In the longer term, Parameter and Report may be roles rather than classes within Data. For now, however, they remain in their original location within SWO.
  2. Additionally, in the near-to-medium future, the Core data class may become obsolete, and all children of Core data would be moved to be children of Data directly. This fits with what Robert thinks should be done.
  3. There is a serious question about SWO data classes such as AP-MS data. This data class describes a very specific subset of data, and Helen has concerns about whether classes like this should be present. Similarly, classes such as CSV data set may not be required, as this could equally well be described with an anonymous data class which has a format of CSV. Indeed, many pieces of data could be described in this way, requiring a very clear definition of when it is appropriate for data classes to be created. I had imagined that broad classes of data (which could have many formats) are what belongs in a data hierarchy. Examples of this would include microarray data and image data. The result of better defining what should be a data class should be added to the ontology in a comment label of the data class, and perhaps a blog post about it too (the comment could just reference the blog post).
  4. Although HTML report does seem to be a report rather than a member of data, I have left it where it is in the hierarchy until the EDAM hierarchy (and how Reports are to be modelled) has been decided.
  5. Meta data is modelled within data for SWO, and as a child of Report within EDAM. Therefore, for the same reasons that no reconciliation of HTML report is being performed, until the new modelling of the Report hierarchy has been decided, I see no reason to consider moving the SWO class.
Posted in ontology | Tagged , , | 2 Comments

SWO-EDAM Merge: merging data hierarchies

There are fewer classes within SWO data hierarchy compared with the SWO data format specification hierarchy, and therefore this merge step was simpler. A list of issues arising is available in a separate post. The following changes were made:

  • Made EDAM:Data a child of “Information content entity” (via subclass axiom stored within SWO_core.owl)
  • Made EDAM:Data equivalent to IAO:0000027(data) – (via equivalent class statement)
  • SWO classes obsoleted due to equivalence with EDAM:
    • Image (efo/swo/SWO_0000580) with Image (data_2968).
    • Heatmap (efo/swo/SWO_0000571) with Heat map (data_1636). As the EDAM class is a child of Microarray image, the subclass axiom relating the SWO class to data has been removed as it is now unnecessary.
    • Microarray data (efo/swo/SWO_0000624) with Microarray data (data_2603).
    • ontology (swo/ontology) with Ontology (data_0582).
Posted in ontology | Tagged , , | 1 Comment

SWO-EDAM Merge: General issues arising

This post will be updated as work on the SWO-EDAM merge progresses, and describes issues which need to be addressed but which don’t belong to any one particular hierarchy. Some of the points raised here are just my opinion, and may not be the right decision in the end.

  • Version information is referenced as a Report in EDAM, and differently in SWO. These should be resolved.
  • There is a lot of duplicated hierarchies of taxonomy within the Topic hierarchy. The same is true of lots of other mini-hierarchies, like the ones for UniProt IDs. This creates difficulties in maintenance (both in ensuring all hierarchies are updated equally, and in readability) and in front-facing use (lots of classes with identical labels and different meanings and purposes). For instance, the UniProt ID mini-hierarchy is present in at least 5 different places within the data hierarchy. Another example is the “completely unambiguous pure” mini hierarchy, which is present at least 3 times in the format hierarchy.
  • There are a number of formats referenced within SWO via, e.g.,  X ‘has specified input’ some ‘formatY’. This is odd, as it seems (though isn’t axiomised anywhere) that software should only be related to format via has format specification and perhaps a data type. However, as there are no restrictions as to how has specified input (or output) are used, we get multiple different axioms for linking software and data formats together. This needs to be more rigorously defined
  • has format specification (from SWO) and has format (from EDAM) should be aligned and merged, once the operation/software issue has been resolved.
  • In my opinion, there should only be a single asserted hierarchy within EDAM, and in many places (e.g. Format) there is a deliberate placing of all child classes within a broader, biological context (e.g. Format (typed)) and a more basic format (e.g. Textual format). I believe that it should be relatively straightforward to make the classes within Format (typed) defined classes, and use the reasoner to place all formats into the appropriate biological format type.
  • Within EDAM, there are many instances where two different classes have the same label/name. One example is Ontology the data class (data_0582), and Ontology the topic (topic_0089). This is a Bad Thing, and should be fixed in all cases.
  • When the inferred hierarchy is calculated in Protege, the EDAM Data and EDAM Operation classes are inferred to be equivalent. This is definitely not true, and has the knock-on effect that the SWO data class is also inferred to be equivalent to Operation (because EDAM Data and SWO data are asserted to be equivalent). This need to be fixed as early as possible.
Posted in ontology | Tagged , | 1 Comment

SWO-EDAM Merge: Issues arising from Format Merge

Until the EDAM OWL files become the master versions of the ontology, it is not useful to obsolete EDAM classes and replace them with the appropriate SWO classes. Instead, a number of equivalence statements were used. Once the EDAM OWL files are the master versions of the ontology (rather than the OBO files), then the following changes will be made to get rid of these equivalence statements:

  1. Format (EDAM) will be obsoleted and replaced with Data format specification (SWO). Both labels will be retained, with “Format” as an alternative term.
  2. Textual format (EDAM) will be obsoleted and replaced with Text file format (SWO) – here, the EDAM annotation label may be the preferred label.
  3. XML (EDAM) will be obsoleted and replaced with XML (SWO).
  4. MAGE-TAB (EDAM) will drop the subclass statement for Textual format, and will instead only be a subclass of Tab delimited file format (and, for now, Format (typed)),

The following classes within SWO should either be further classified deeper in the data format specification hierarchy or, if we remain unsure what formats they actually are, should be obsoleted:

  1. .data format, and I am unsure how to further classify it without knowing exactly which format is being described
  2. .rma format (as it seems to be an algorithm for affy data rather than a format);
  3. Xba.CQV and Xba.regions (some kind of internal format to the RLMM Bioconductor package I think?);
  4. chamber slide format (used for splots Bioconductor package);
  5. covdesc (described in relation to Bioconductor on one or two websites, but no clear description of what it is: http://permalink.gmane.org/gmane.science.biology.informatics.conductor/34506);
  6. design file and pair file (both part of the HELP system – you try searching for that!);
  7. gmt format (there are no usages in SWO of this class, and you get the time zone as a result when searching);
  8. log file and pedigree file (i.e. patient information) seem just wrong as formats and perhaps should be done away with, except perhaps as roles;
  9. logicFS dataset belongs in data, if it belongs anywhere (currently in data format specification) – even the axiom referencing it doesn’t seem right;
  10. sproc (which could mean a stored procedure, or perhaps a custom file within Bioconductor, but it’s hard to tell, and even harder to tell the format if it is a Bioconductor package input);
  11. sqlite is a program / piece of software, not a format, but it could be that there is an sqlite data dump that is a specific format?

Additionally, within SWO I think it might be better to make the outlines (specifically, the class Outline document format) a role rather than a format, as its child classes (e.g. OPML) should instead be stored within their format type hierarchy. In the case of OPML, this would be as a child of XML. This may or may not also be appropriate for Document exchange format and its hierarchy.

Within EDAM, BioPAX is subclassed as a child of OWL within the Format hierarchy. This is not particularly useful, as OWL can have multiple serializations in different formats. I believe that the EDAM BioPAX should be obsoleted in favor of one of the various BioPAX formats described in SWO (either the Manchester OWL or the RDF/XML versions, which have IDs of http://www.ebi.ac.uk/swo/data/SWO_3000056 and http://www.ebi.ac.uk/swo/data/SWO_3000055).

Also, a broader decision needs to be made on the placement of the HTML class within EDAM; within SWO it is a child of Web page specification, which itself is a child of data format specification. Within EDAM, it is a direct child of Format. EDAM may need to add the Web page specification class so that, ultimately, the final merging (rather than the current temporary method of just marking the two classes as equivalent) of the format hierarchies won’t end up with HTML in two different places within the final data format hierarchy.

Posted in ontology | Tagged , , | 2 Comments

SWO-EDAM Merge: Merging format hierarchies

Import of EDAM:Format into SWO:Data format specification

The following steps have been performed and committed within the development subdirectory of the SWO subversion repository. This merging is not complete (due to the EDAM OWL file not being the master version of the file yet), and there were a number of issues resulting from these steps. Both of these topics are itemized in an additional blog post.

  • Make EDAM:Format a child of “Information content entity” (via subclass statement)
  • Make EDAM:Format equivalent to IAO:0000098 (data format specification) – (via equivalent class statement)
  • SWO data format specification classes moved to other locations within SWO:
    • Renamed Tab delimited file to Tab delimited format
    • Split CDF into its two possible formats: CDF ASCII format and CDF binary format
    • Classes moved to be children of XML: MAGE-ML files (also renamed to MAGE-ML); KGML files (also renamed to renamed to KGML); ARR; gxl format;
    • Classes moved to be children of Text file format: SBMLR format (renamed from SBMLR file), which does not have a definition but seems to be the R files containing SBML data as specificied by the SBMLR package for R; SDF format; Rnw (a type of Sweave document); R data frame (a bit like a table in R); GEO Matrix Series File (also replaced “file “ with “format”); FCS; FASTA; CDF ASCII format; BED file (also renamed to BED format); cls; dcf; gff format; mas5 format; rda.
    • Classes moved to be children of Tab delimited format: MAGE tab format; cdt; gct; gpr format; gtr.
    • Classes moved to be children of Programming language format: Matlab m file (also renamed to Matlab .m file).
    • Moved under EDAM:format_2333 (Binary format): CEL; CHP file; CDF Binary format; BPMAP; lma.
  • Refactoring to remove EDAM:SWO duplicates – by obsoletion of EDAM or SWO concept as appropriate:
    • RDF: Present in both EDAM (format_2376) and SWO (SWO_3000006). The SWO term is retained, as EDAM is intended as a more bioinformatics-specific ontology, and the RDF hierarchy within SWO is more complex. All annotations have been moved across to the SWO RDF class and the EDAM RDF class has been marked as obsolete.
    • Document exchange format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Image format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Outline document format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is. However, it is recommended that outlining become a role, as classes like OPML should instead be stored within the XML hierarchy.
    • Programming language format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Spreadsheet and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • Text file format (SWO_3000041) and Textual format (format_2330) are equivalent, and are marked with equivalence statements. Each have a hierarchy of children which must be taken into account. Obsolete cases are itemized below, while all others remain in their original hierarchy for now. Please note that the obsoleting of terms occurs by refactoring the URI of the concept to be obsoleted to the URI of the class it is being merged with. Until the two classes Text file format and Textual formatare properly merged/obsoleted, the (slightly messy) dual hierarchy will remain. However, the inferred hierarchy will show the complete set of children (from both ontologies) under both classes.
      • BED format (efo/swo/SWO_0000051) and BED (format_3003) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations.
      • FASTA (efo/swo/SWO_0000142) and FASTA format (format_1929). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations and is present within a more complete FASTA hierarchy.
      • OBO Flat File Format (swo/data/SWO_3000040) and OBO (format_2549). The EDAM class will remain and the SWO class will be obsoleted, the SWO label (which is highly descriptive), will be retained.
      • gff format (efo/swo/SWO_0000559) and GFF (format_2305). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations and is present within a more complete GFF hierarchy.
      • newick (efo/swo/SWO_0000634) and newick (format_1910). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This is because the EDAM class has more annotations.
      • MAGE tab format (swo/data/SWO_3000045) and MAGE-TAB (format_3162). The SWO class has been obsoleted and all its axioms assigned to the EDAM class. This creates an extra parent class for MAGE-TAB (Tab delimited file format, which is the parent class for the SWO MAGE tab format). Once the EDAM OWL files become the master version of the files, we can fix this by removing the Textual Format subclass statement.
    • Web page specification in SWO has as its only child the EDAM HTML hierarchy. As such, no changes need to be made within SWO.
    • Word processing document format and its children have no equivalent classes within EDAM, and therefore were retained within SWO as-is.
    • XML (swo/data/SWO_3000005) and XML (format_2332) are equivalent, and are marked with equivalence statements. Each have a hierarchy of children which must be taken into account. Obsolete cases are itemized below, while all others remain in their original hierarchy for now. Please note that the obsoleting of terms occurs by refactoring the class URI to the URI of the class it is being merged with. Until the two XML classes are properly merged/obsoleted, the (slightly messy) dual hierarchy will remain. However, the inferred hierarchy will show the complete set of children (from both ontologies) under both classes.
      • MAGE-ML (efo/swo/SWO_0000268) and MAGE-ML (format_3161) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class.
      • PSI-MI format (swo/data/SWO_3000048) and PSI MI XML (MIF) (format_3158) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class. Retained SWO label as an alternative term.
      • SBML (swo/data/SWO_3000037) and SBML (format_2585) are equivalent, and the SWO class has been obsoleted and all its axioms assigned to the EDAM class.
Posted in ontology | Tagged , , , | 1 Comment

SWO-EDAM Merge: Modifying EDAM in OWL

While on the whole the conversion process between OBO and OWL (in either direction) is relatively good these days using Protege 4.1 (which makes use of the OWLAPI underneath), there were some manual changes required. This post describes all changes made to the automatically-generated EDAM OWL file prior to importing into SWO.

  1. Change the (initially anonymous) ontology URI to http://edamontology.org within the EDAM OWL file.
  2. Replace the erroneous URI created by the Protege conversion for DeprecatedClass to the correct OWL URI of http://www.w3.org/2002/07/owl#DeprecatedClass
  3. The conversion process also creates an incorrect “owl2” namespace for all attributes of properties (e.g. reflexive, transitive, see also the point directly above) as follows:
    <!ENTITY owl2 “http://purl.obolibrary.org/obo/http_//purl.org/obo/owl:&#8221; >
    This was fixed manually by removing the incorrect namespace and replacing all instances of “owl2” with “owl”. However, when the file is saved again in Protege, even though the owl2 namespace isn’t used at all any more, it is still re-created. It was likely due to the following lines in the OWL file, which have now had the incorrect string  “http_//purl.org/obo/owl:” removed:
    <owl:AnnotationProperty rdf:about=”&obo;http_//purl.org/obo/owl:is_symmetric”/>
    <owl:AnnotationProperty rdf:about=”&obo;http_//purl.org/obo/owl:is_reflexive”/>
  4. Replaced incorrect URI for definition_editor with the correct one: http://www.ebi.ac.uk/efo/definition_editor).
  5. All of the inverse, range and domain statements are stored as annotations by default when the conversion process happens, rather than being stored as OWL descriptions. For cases such as this:
    <obo:inverse_of>EDAM:is_format_of</obo:inverse_of>
    additional statements such as the following were added:
    <owl:inverseOf rdf:resource=”&obo;EDAM_is_format_of”/>
    Nothing was deleted so that conversion back to OBO would be OK.
  6. The URI of all EDAM classes needed to be fixed after conversion from OBO, as the default URI for all classes is the purl.obolibrary.org-based URI. The global search and replaces performed are:
    • Remove “obo” namespace, as these are all incorrect, and should instead be using the edam default namespace. In practice, this means deleting all instances of the strings “obo:” and “&obo;”, as well as the declaration of the associated namespace at the top of the file.
    • Remove the string “EDAM_” from anywhere in the OWL file, as this is now unnecessary, and if it remains, then the URIs for the class names are incorrect.
  7. Separated out all modules (data (containing both data and format hierarchies), operation, topic), into their own files. These files are named edam_data.owl, edam_operations.owl, edam_obsoletes.owl and edam_topics.owl. While modularizing edam_obsoletes.owl, one leftover axiom was identified on one of the obsoleted classes. At Jon Ison’s request, I have deleted this axiom: Genotype and phenotype annotation format (an obsolete class) SubClassOf ‘is format of’ some Genotype/Phenotype annotation (not obsolete).
  8. An additional edam_core.owl file was made to easily view EDAM on its own and to store all cross-file / cross-module axioms. By storing cross-file axioms in the edam_core.owl, the individual module files are much cleaner. The edam_core.owl file is then imported into SWO. If in future we wish to move all cross-file axioms out of edam_core.owl, then this can be done, and the three module files could be imported directly.
  9. Import edam_core.owl into SWO. This simplistically-merged ontology reasons with no inconsistencies (although there are some interesting features of EDAM in OWL that Jon will want to take a look at, particularly the inference that Data and Operation are equivalent).
Posted in ontology | Tagged , , , , , | 1 Comment