Printable Version

Format Documentation

This is not intended to be an exhaustive description of the various file formats in which the eVOC ontologies are available. Rather, it is meant to be used as a guide while exploring the eVOC files themselves. If you have any questions about the formats used by eVOC, send them to the eVOC mailing list, evoc@sanbi.ac.za. Deprecated formats are no longer a part of the current release but are described for those making use of the archived eVOC versions.

Table of Contents

  1. eVOC's Native Format (deprecated)
  2. GO Flat File Format
  3. OBO and OBOC Formats
  4. OWL Format (deprecated)
  5. XSPAN OWL Format
  6. RDF Format
  7. HTML Format (deprecated)

eVOC's Native Format

eVOC's native format presents an ontology as a tab-indented hierarchy of nodes (which have attached terms). The nodes are laid out in a text tree. The ontology's root node is not indented. The children of the root node are indented by one tab. The children's children are intended by two tabs, and so on. The children of each node are laid out directly beneath their parent in the text file.

Each node entry line consists of a term name followed by a optional comma-separated list of synonyms in curly braces ({}). Excluding the initial tabs each line has the form:

<term name> {<synonym>,<synonym>,...}

There may be any number of synonyms in the comma separated list. If there are no synonyms the curly braces are omitted. The <term name> contains no curly braces and <synonym> contains neither curly braces nor commas.

Other than the node entries, the only lines which may occur in the current eVOC files are cDNA library entries. These entries list cDNA clone libraries which have been mapped to a particular term. The clone library entries occur immediately below the corresponding term before any sub-terms and are indented one level more than the term itself. The format of the clone library entries in eVOC version 2.1 and later is:

cDNA library[Name:<library name>][dbEST Library ID:<id number>]

In versions prior to 1.9 the phrase Clone Library was used instead of cDNA library, while prior to 2.1 dbEST ID was used instead of dbEST Library ID. The <library name> is the name of the cDNA library and does not contain square braces. The <id number> is the unique ID number assigned to the cDNA library by the dbEST division of Genbank, http://www.ncbi.nlm.nih.gov/Genbank/.

Additional information on the dbEST libraries can be retrieved by looking up the dbEST Library ID at NCBI:

http://www.ncbi.nlm.nih.gov/UniGene/library.cgi?ORG=Hs&LID=<id number>

Note that you should replace <id number> with the dbEST Library ID that you want to look up. For example:

Usually, there are two files generated for each eVOC ontology. One file (referred to as the ontology file), contains only the ontology (ie: the structured terms/synonyms). The other (the annotations file) contains the ontology and any data which has been mapped to the ontology (eg: cDNA libraries).

GO Flat File Format

The GO file format was development by the Gene Ontology (GO) Consortium. More complete documentation is available from the GO website at http://www.geneontology.org/. The GO flat file format is in the process of being replaced by the OBO format (see the OBO section).

GO files start with a number of header lines each of which looks like:

!<variable name>: <value>

Variable names used by the GO flat file version of eVOC include:

  • autogenerated-by
  • saved-by
  • version
  • date
  • type

The first three all accept arbitrary text strings as a <value>. DAG-Edit (the ontology editor distributed by the GO Consortium) is able to parse only a limited set of values for date. In the eVOC GO files the date format used is the same as that used by the GO consortium:

Fri Apr 30 11:02:47 SAST 2004

The GO header may contain multiple type variable lines. Each type line defines one of the relationship symbols used within the GO file. The syntax is:

!type: <symbol> <relationship> <description>

eVOC files use only the ISA <relationship> which specifies that one term is the child (sub-class) of another. The relationship is assigned the % <symbol>. This symbol assignment is the standard one for the ISA relationship.

The header is followed by series of term entries which are laid out in much the same fashion as those in the native eVOC format (see the native eVOC section) except that spaces, rather than tabs, are used to indent the term hierarchy.

The syntax of the term lines used in the eVOC GO flat files is

% <term name> ; EV:<id> ; synonym:<synonym> ; synonym:<synonym>

The <id> is a seven digit ID number, left-padded with zeroes if necessary. The two letter code EV stands for 'eVOC' and specifies the namespace to which the <id> belongs.

Each line may have any number of synonyms. If there are no synonyms the semi-colon after the id is omitted.

The full GO file format allows for richer syntax than currently used by eVOC. Consult the GO file format documentation available on the GO Consortium web site for more information.

OBO and OBOC Formats

The Open Biological Ontologies (OBO) project, http://obo.sourceforge.net/, is an attempt to standardise the numerous biological ontologies and combine them into a single resource. The OBO format was developed by the GO Consortium as a replacement for the older GO flat file format (see the GO section). Documentation for both formats is available from the GO Consortium web site, http://www.geneontology.org/. The OBOC format is an extension of the OBO format to allow linking terms to cDNA libraries.

An OBO file consists of one header section followed by any number of stanza sections. Sections of both types consist of tag lines:

<tag name>: <value>

These assign the given value to the named tag. Some tags may occur multiple times, even within the same stanza. In these cases the tag is considered to have a list of values. The only other type of line is the stanza header

[<stanza type>]

which determines the type of the stanza which follows. The eVOC OBO files use only the Term stanza type.

The header section consists only of tag lines and must begin with a line specifying the format-version tag. The OBO format version used in the eVOC OBO files is GO_1.0. Other tags which may be set in the header section include:

  • version (the ontology version)
  • date (the date)
  • saved-by (who or what saved the file)
  • remarks (general comments)

Each stanza consists of a stanza header line followed by a group of tag lines. The term stanzas used in the eVOC OBO files contain only six types of tags:

  • id - gives the eVOC ID of the term. The ID consists of two letters identifying the id namespace (EV in the case of eVOC ontologies) followed by a colon (:) and then a seven digit ID number, left-padded with zeroes if necessary.

  • name - contains the name of the term.

  • is_a - specifies the eVOC ID of the term's parent. The root term has no parent and hence no is_a tag.

  • synonym - may occur multiple times and each gives a synonym of the term being described. The format of a synonym value is: "<synonym>" [<external reference>] <synonym> gives the synonym name. The <external reference> is a list of database references which may be related to this synonym in some way. This is not used by the eVOC ontologies and is therefore left blank.

  • xref_unk (no longer used) - links the term to outside objects. The eVOC ontologies once used this to link terms to cDNA libraries. The value of the tag used the syntax Name: <clone library name> "DB EST <id number>". The <clone library name> is a string containing the library name. <id number> is the dbEST ID assigned to the clone library by the dbEST division of Gen Bank, http://www.ncbi.nlm.nih.gov/Genbank/.

  • cdnalib (only by OBOC) - links terms to cDNA libraries. The value of the tag has the syntax <clone library name> <dbEST library id>. Note that <clone library name> may contain spaces so parsers should look for the <dbEST library id> by looking for the last word, not the second word.

The OBO documentation on the GO Consortium web site recommends that xref_unk tags not be used and allows for the possibility that the parsers may report the use of such tags as parse errors. As an alternative, the OBO format allows OBO files to contain user defined tags. A parser which does not understand these additional tags should ignore them. Thus OBO parsers which comply with the specifications should be able to read OBOC files. Currently DagEdit parses OBOC files without problems.

OWL Format (deprecated)

OWL is the Web Ontology Language (the name is from A. A. Milne's Winnie the Pooh in a which the dyslexic character Owl spells his name Wol) and is intended for use in describing and reasoning about ontologies from a wide range of fields. OWL is an extension of RDF (Resource Description Framework, http://www.w3.org/RDF/) and as such all OWL documents are also XML (Extensible Markup Language, http://www.w3.org/XML/). The OWL documentation can be found on the W3C Web-Ontology Working Group site, http://www.w3.org/2001/sw/WebOnt/. If you are new to OWL, XML and RDF I suggest starting with the OWL Language Guide, http://www.w3.org/TR/owl-guide/.

OWL comes in three flavours: OWL Lite, OWL DL and OWL Full. The eVOC ontologies use only the simplest of the three, OWL Lite.

An eVOC OWL document begins by specifying the XML version and the document type (or DOC-TYPE). In the case of OWL the DOC-TYPE is RDF. The DOC-TYPE definition may also include a number of entity tags which define constants for use later in the OWL (or XML) document. For example

defines an XML entity &owl; with the value http://www.w3.org/2002/07/owl# .

The document type is followed by the opening RDF tag. This tag also sets up aliases for the XML namespaces which will be used with the OWL document. For instance the attribute

xmlns:owl = "&owl;"

informs the XML parser that all XML identifiers (names) which start with owl: are defined in the namespace given by &owl;. Since we have previously defined &owl; to be http://www.w3.org/2002/07/owl# it points the XML parser to the OWL namespace. This allows us to refer to XML entities defined by the OWL schema using, for example, owl:Class rather than http://www.w3.org/2002/07/owl#Class .

With all the XML and RDF housekeeping complete, we can now move onto the eVOC ontology itself. The ontology begins with two AnnotationProperty tags which declare that we will be using creator and date tags as annotations in the ontology. These are followed by an Ontology tag which collects together information about the ontology as a whole. Currently the eVOC ontology tag includes: a human readable name for the ontology (in the RDF-Schema label tag), the ontology version (in the OWL versionInfo tag), the creator (in the AnnotationProperty creator tag), the date (in the AnnonationProperty date tag) and a note about the ontology's origin (in the RDF-Schema comment tag).

The actual description of the eVOC ontology is accomplished using a series of class, thing and property tags. Each class tag corresponds to a term in the ontology. Each thing tag represents something which can be a member of a class. In the current eVOC ontologies the only objects described by thing tags are the clone libraries. Property tags specify relationships between objects and other objects or between objects and data.

At present the eVOC ontologies specify only one type of property, namely a DatatypeProperty called dbEST which is used to link clone libraries (things) to dbEST IDs (positive integers).

Class tags require unique IDs so that they can be referred to later in the OWL document. These are specified using the ID attribute (from the RDF namespace) of the class. eVOC classes are identified by their term's eVOC ID. However the colon (:) in the id is replaced by a dot (.) since colons are not allowed in XML names (they are reserved for use in specifying the namespace to which a name belongs).

The eVOC Class tags use two sub-tags:

  • label - tag which contains the term name
  • subClassOf - tag which gives the ID of the parent term (class).

The root term of the ontology does not have a subClassOf tag.

The Thing (cDNA library) tags on the other hand use three sub-tags:

  • label - tag which holds the clone library name
  • type tag which gives the ID of the term (class) to which the library belongs
  • dbEST tag which links the clone library to a dbEST ID.

Note that a cDNA library may belong to multiple classes (and thus have multiple type tags) since a particular cDNA library may contain data relevant to multiple terms in an ontology.

XSPAN OWL Format

All of the general discussion about OWL in the section above also applies to the XSPAN OWL format and you should read that first if you're unfamiliar with OWL. XSPAN relies on a custom RDF Schema which can be found at http://www.xspan.org/obo.owl. We will refer to this schema as obowl (which is also the abbreviation XSPAN seem to use internally).

In the XSPAN OWL format terms have type obowl:term and each term is associated with its name by the obowl:name tag and with its accession by the obowl:accession tag. In addition, each term is also connected to its name by a rdfs:label tag. This allows generic RDF parsers to attach a human readable name to a term.

A term is connected to its parent using an rdfs:subClassOf tag. Root terms have no subClassOf tag.

Some discussion of matters relating to the XSPAN OWL format is available in Aitken, J.S., Webber, B.L. and Bard J.B.L. (2004b) Part-of Relations in Anatomy Ontologies: A Proposal for RDFS and OWL Formalisations. Proc PSB 9 :166-177.

RDF Format

OWL is an extensions of RDF specifically tailored for ontology development. However, OWL can require quite sophisticated inference from parsers and much of the extra functionality provided by this is not needed by simple biological ontologies like eVOC. As a result there seems to be a growing trend for ontology work to be done directly in RDF using the basic sub-classing properties provided by RDFS and avoiding the more advanced functionality provided by OWL.

Currently the eVOC RDF format is still undergoing development and is likely to change frequently so only a broad outline will be provided here.

Each ontology is distributed as two RDF files. The first contains only a description of the ontology terms and their relationships. The second details which cDNA libraries are annotated to which terms. This allows those who wish to deal only with the ontologies to ignore the cDNA library annotations entirely.

Every term is described by an RDF node which has the type Class (from RDFS) and is related to its parents by the property subClassOf (also from RDFS). Term names are associated with nodes using the label property. Term accessions are used to form a unique URI for the term (in fact the URI may be considered a globally unique accession).

A more complete description will follow as soon as the format stabilises.

HTML Format

The HTML version of eVOC is intended for viewing in a web browser. It is not designed to allow easy retrieval of the ontology terms using a machine parser. If you wish to have a program parse the eVOC ontologies, rather use one of the previously described formats.

The HTML format conforms to the XHTML 1.0 Transitional standard. It makes use of CSS to layout the terms and Javascript to allow terms to be folded and unfolded (expanded).

The body of the HTML document starts with a short header which gives the date on which the file was generated and which ontology and eVOC version the file represents. The rest of the document contains a series of heavily nested div tags which describe the ontology terms.

Most terms have two div tags. The first is used when the term is unexpanded and contains just the term name and some formatting tags. The second is used when the term is expanded and contains any synonyms, clone libraries and/or sub-terms the term possesses. If a term has none of these then it will have only the first div tag.


Email Customer Support | Join eVOC Mailing List

Page last modified on November 21, 2006, at 01:20 PM