Index

SDD Part 0: Introduction and Primer to the SDD Standard

Abstract

SDD Part 0 is a non-normative introduction to the Taxonomic Databases Working Group SDD (Structure of Descriptive Data) Standard. Its intention is to provide a background, introduction and primer to the SDD Standard, with examples. Since the SDD Standard is a work-in-progress, this document will be updated from time to time.

Status of this document

Version: 3 Dec. 2003

Edited: Kevin Thiele (Centre for Biological Information Technology, University of Queensland), with financial support from the Gordon and Betty Moore Foundation (www.moore.org).

This document has not yet been reviewed by the SDD Working Group of TDWG. It is intended to facilitate discussion.

To contribute to the discussion on the SDD Standard and to comment on this document, please join the SDD discussion list by emailing the SDD List Server or contribute to the SDD Wiki.

Complete documentation of the SDD Schema is available on the SDD web site.

Relationship between SDD and other TDWG standards

TDWG maintains other standards that relate to the SDD standard, particularly the (Names) standard. Wherever possible, element names conform across standards.


1.0 Introduction

1.1 Background to the TDWG-SDD Subgroup

In September 1998 the Taxonomic Databases Working Group (TDWG) of the International Union of Biological Sciences (IUBS) established the Structure of Descriptive Data (SDD) subgroup. TDWG’s role is to facilitate and manage the development of international standards in the taxonomic domain. The SDD subgroup was established to develop an international XML-based standard for capturing and managing descriptive data for organisms).

Development of the SDD standard was initiated in response to recognition that the existing standard previously endorsed by TDWG – the DELTA data standard developed at CSIRO in Canberra from 1971 and adopted by TDWG as a descriptive data standard in 1991 – had become inadequate (FAQ: Why not continue to use DELTA?).

The SDD subgroup began discussing and scoping a standard through an email discussion group in November 1999 (see the SDD email list archives). Considerable progress has been made at face-to-face meetings amongst a small group of core contributors, in Nov. 2001 (Canberra), Oct. 2002 (Sao Paulo), Feb. 2003 (Paris) and October 2003 (Lisbon).

Version 0.9 of the SDD standard and Version 0.9 of this document were released on the TDWG website in December 2003.

1.2 The nature of descriptive data in taxonomy

In taxonomy, descriptions of taxa are one of the prime storages for both raw and highly processed data. Virtually all known organisms have published descriptions in some form - indeed, it is a requirement under the International Codes of Botanical and Zoological Nomenclature that valid publication of a new taxon must include a diagnostic description. Descriptions of taxa form the core of biological monographs and of Floras and other field guides.

Descriptions in taxonomy take several forms. The most common and least tractable is the natural-language description (Box 1). A natural-language description is a semi-structured, semi-formalised description of an organism or (more usually) a taxon. Natural-language descriptions may be simple, short and written in plain language (if used for a popular field guide), or long, highly formal and using specialised terminology when used in a taxonomic monograph or other treatment.

Box 1.2.1 - Typical natural language descriptions

Calidris canutus (Red Knot)
Stout wader with bill same length as head, crown unstreaked, narrow white bar in wing, pale rump with grey barring, shortish olive legs. Non-breeding: grey above with narrow pale edging to feathers, pale eyebrow, smudged sides to neck with faint spotting. Juvenile: feathers of back edged white with dark subterminal bar, breast more heavily spotted pale buff and flanks barred, crown faintly streaked. Breeding: rufous underparts, feathers of back rufous patterned with black. Voice: 'knut-knut', `nyui , high-pitched `toowit-wit'.

from Slater, P., Slater, P. & Slater, R. (2001) The Slater Field Guide to Australian Birds  (Reed New Holland: Sydney)

Tithorea harmonia Godman & Salvin
Antennae orange, forewing short with pointed tip, white checks on wing edges, spots on ventral hindwing margin paired, black bar on lower edge of forewing discal cell, black bar above hindwing discal cell, discal bar reduced to a spot or absent.

from www.cs.umb.edu/~whaber/Monte/Ithomid/Tith-harm.html

Discaria pubescens (Brongn.) Druce
Rigid, spreading shrub to c. 1 m high and wide; stems glabrous. Leaves soon deciduous, c. oblong, to 10 mm long, 3 mm wide, obtuse or minutely mucronate within an apical notch, margins minutely toothed, surfaces glabrous or a few hairs present near tip; stipules dark reddish-brown, c. 1 mm long, often shallowly joined around the node, pubescent on inner face; spines stout, 1.5-4 cm long. Flowers white, solitary or in few-flowered axillary cymes, sometimes congested on short apical shoots; pedicels 2-3 mm long; hypanthium c. 1.5 mm long; sepals somewhat spreading, 1-1.5 mm long; petals attached at throat of hypanthium, c. 1 mm long; stamens subequal to and weakly hooded by petals; disc prominent, lining base of hypanthium, obscurely 5-angled; style minute. Capsule prominently 3-lobed, 4-5 mm diam., the valves separating incompletely at maturity and splitting dorsally and medially.

from Walsh, N.G. (1999) Rhamnaceae, in N.G.Walsh & T.J.Entwisle, Flora of Victoria Volume 4, Dicotyledons, Cornaceae to Asteraceae (Inkata Press: Melbourne)

 

A relatively small number of descriptions comprise fully structured data, such as Lucid LIF files (Box 2), DELTA descriptions (Box 3) and NEXUS data files.

Box 1.2.2 - a simple Lucid Interchange Format (LIF) file

#Lucid Interchange Format File v. 2.1

[..Character List..]
Distribution by region
  Tropical North
  Subtropical and Temperate East and South
  South West
  Arid & Semi-arid (Central)
  Island Territories
General habit
  tree
  shrub
  climber (woody or herbaceous)
  herb
  grass- or sedge-like plant
Seasonal longevity
  annual, biennial or ephemeral
  perennial

[..Taxon List..]
Acanthaceae
Aceraceae
Actinidiaceae
Agavaceae
Aizoaceae
Akaniaceae
Alangiaceae
Alismataceae
Aloaceae
Alseuosmiaceae

[..Main Data (txs)..]
101101111111
100100000101
101000000010
011110111111
101111111111
100100000011
101101000011
011111011111
011100100111
101100000010

Box 1.2.3 - a simple DELTA file

*SHOW: Gentianella - character list. Last revised 16 April 1997.

*CHARACTER LIST

#1. plants/
1. monocarpic/
2. polycarpic/

#2. <plants lifecycle>/
1. annual/
2. biennial/
3. perennial/

#3. height in flower/
<> cm/

#4. caudex/
1. unbranched/
2. branched/

*ITEM DESCRIPTIONS

# Gentianella amabilis/
1,2 2,3 3,3-13 4,1

# Gentianella antarctica/
1,1 2,1<Godley 1982> 3,1.6-22.0<Godley 1982> 4,1

# Gentianella antipoda/
1,1<Godley 1982> 2,2 3,3.5-9.8-24 4,1/2<depends on size of plant>

# Gentianella astonii/
1,2 2,3 3,15 4,2

# Gentianella cerina/
1,2 2,3 3,9-17 4,1/2

#Gentianella concinna/
1,1 2,1 3,2.7-15.0 4,1

 

 

Most descriptions of organisms - natural language descriptions - are devoid of data markup and are almost entirely intractable for processing by analytical engines and data-mining routines. Structured descriptions, including descriptions in DELTA, Lucid and NEXUS formats, use more or less proprietary formats that are intimately tied to one or a small number of software implementation, and in general evolution of the software platform and of the format occur in tandem. In cases where packages provide tools to translate between formats (e.g. the Lucid-DELTA Translator and the DELTA CONFOR programs), translation is usually lossy (because the different formats maintain different data structures), and maintenance of the translation programs is difficult (since they must track changes made to proprietary formats on both sides of the translation). Further, if the software platforms lose support the data stored in the proprietary format for that software become legacy data and cannot be easily maintained.

1.3 Goals of SDD

The SDD subgroup consider that an independent, international standard for descriptive data is important. Such a standard is crucial to enabling lossless porting of data between existing and future software platforms including identification, data-mining and analysis tools, and federated databases. The absence of such a standard is a major impediment to the greater use of digitised descriptive data, and brings substantial inefficiencies to taxonomy as a whole.

The SDD Standard intends to:

SDD will be XML-based, and will provide a schema for validation of documents.

SDD seeks to facilitate:

2.0 Basic structure of a simple SDD instance document

The simplest possible description comprises a single descriptive statement about an organism, taxon or object. An example of such a description is given in Box. 5, and its SDD representation in Example 1.

Box 2.0.1 - A simple description

Viola hederacea Labill.
Leaves simple

 

Example 2.0.1 - Description in Box 2.0.1 represented in SDD

<?xml version="1.0" encoding="UTF-8"?>

<Document xmlns="http://www.tdwg.org/2003/SDD_09" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.tdwg.org/2003/SDD_09 C:\DOCUME~1\KEVINT~1\Desktop\SDDBET~1\SDD_09.xsd">

  <GenerationMetadata TimeStamp="2002-11-08T10:00:00" GeneratorName="n/a, handcrafted instance document" GeneratorVersion="n/a"/>

  <ProjectDefinition>
    <Version>
      <Major>1</Major>
    </Version>
    <RevisionData>
      <Authors>
        <Agent ref="1"/>
      </Authors>
      <InitiationDate>1999-08-13T00:00:00</InitiationDate>
      <LastRevisionDate>2003-11-05T00:00:00</LastRevisionDate>
    </RevisionData>
    <AudienceSpecificData>
      <Representation audience="en5">
        <Title>The Genus Viola</Title>
        <Rights>
          <CopyrightStatement>(c) 2003 Centre for Occasional Botany</CopyrightStatement>
        </Rights>
      </Representation>
    </AudienceSpecificData>
    <Audiences defaultaudience="en5">
      <Audience audiencekey="en5" lang="en" ExpertiseLevel="5">
        <LabelText>Experts</LabelText>
      </Audience>
    </Audiences>
  </ProjectDefinition>

  <Terminology>
    <Characters>
      <Character key="1">
        <Label>
          <Representation audience="en5">
            <Text>Leaf complexity</Text>
          </Representation>
        </Label>
        <Type>nominal</Type>
        <Categorical>
          <States>
            <StateDefinition key="1">
              <Label>
                <Representation audience="en5">
                  <Text>Simple</Text>
                </Representation>
              </Label>
            </StateDefinition>
            <StateDefinition key="2">
              <Label>
                <Representation audience="en5">
                  <Text>Compound</Text>
                </Representation>
              </Label>
            </StateDefinition>
          </States>
        </Categorical>
      </Character>
    </Characters>
  </Terminology>

  <Entities>
    <Classes>
      <Class key="1">
        <FreeFormDescription>Viola hederacea Labill.</FreeFormDescription>
      </Class>
    </Classes>
  </Entities>

  <Resources>
    <Agents>
      <Agent key="1">
        <FreeFormDescription>Kevin Thiele</FreeFormDescription>
          <LastName>Thiele</LastName>
      </Agent>
    </Agents>
  </Resources>

  <Descriptions>
    <CodedDescription key="101">
      <Class ref="1"/>
      <RevisionData>
        <Authors>
          <Agent ref="1"/>
        </Authors>
        <InitiationDate>2003-08-13T10:23:11</InitiationDate>
        <LastRevisionDate>2003-08-13T10:23:11</LastRevisionDate>
      </RevisionData>
      <CharacterData>
        <Character ref="1">
          <State ref="1"/>
        </Character>
      </CharacterData>
    </CodedDescription>
  </Descriptions>

</Document>

SDD documents are structured using seven high-level XML elements. Four of these (listed in bold italic below) are mandatory, while the remaining three are optional.

<Document> is the root of an SDD document, and encloses all other elements

The <GenerationMetadata> element is used to specify metadata about the process (application or script) that created the current SDD document or data stream, such as name of the generating application and date and time at which the document was created.

The <ProjectDefinition> element is used to capture metadata about the project from which the document data are sourced, including details of authors and contributors to the project, the project status, publication and revision dates, sources of data etc.

The <Terminology> element defines a list of characters and their states used to describe the entities described in the document.

The <Entities> element defines a list of entities (such as taxa and specimens) for which descriptions are provided in the document.

The <Resources> element provides for definitions of resources (images, notes, contributors etc) referred to elsewhere in the document.

The <Descriptions> element contains descriptions (either coded or marked-up natural language) of the document's entities

A valid SDD document must include <Document>, <GenerationMetadata>, <ProjectDefinition>  and <Resources> elements. In addition, it may contain a <Terminology> section alone (if used to provide character and state resources from a project), an <Entities> and <Descriptions> section alone (if used to carry natural language descriptions with no markup), or <Terminology>, <Entities> and <Descriptions> elements (in which case it may carry coded or marked-up natural language descriptions).

FAQ: Why are SDD documents so verbose and complex?

3.0 Beyond the simple instance...

Example 1 describes only the most basic of SDD structures. There are two ways to go further: either use the links at left below for more information about specific SDD tasks, or click on an element in the Schema diagram at right below to navigate to information specific to that element.

KRT Last Edit: 31 Dec 03