TDWG working group: Structure of Descriptive Data (SDD)
See also the latest version
As discussed in "Categorical data types", character states have two types of sequence or ordering relations: A sequence in which they shall be displayed in reports and a sequence that is derived from the biological semantics of the states. A display sequence is defined for nominal and ordinal categorical data types, the inner ordering only for ordinal data. Ordinal data express the knowledge that, e. g. for states 1-2-3 it is known that the distance from 1 to 3 is longer than that from 1 to 2 (even though the distances themselves are undefined; compare cardinal scale (= integer data) with equal distance and interval scale (=real numeric data) with known distances.
However, the sequence of state in the terminology is not necessarily identical with the sequence in specific object or class descriptions. For example, a character with the ordered states:
1. very short bristles
2. short bristles
3. long bristles
4. very long bristles
may appear in a specific object description as "4 or rarely 3" as well as "3 or perhaps 4". To distinguish the latter situation from the state order defined in the terminology, it is called the "scoring sequence" in SDD documents.
Ideally an explicit "scoring sequence" should not be necessary in a descriptive data standard and all state ordering should be achieved using a set of rules.
In a report (e. g. natural language description), multiple states within a single character in a description can be ordered by:
These rules would already handle the examples above appropriately. It is, however, difficult to capture all the information in an analytically accessible way. Firstly, in many cases no explicit information is present whether state order in a description is significant or not. Most biologists will assume that "round or obovate" implies an unequal (but unspecified) frequency of round and obovate and is therefore different from "obovate or round". One may view this as a "bad habit" in biology, but it is probably unwise to force people abandoning such habits, and in the case of existing descriptions that are being recoded it may be impossible to provide analytically accessible information like frequency values.
Furthermore, even where additional information is present, this information may not be analytically accessible. For example, the rules do not cover "flower blue, or violet (at the base)" or "flower violet changing to red when mature". The modifier phrases "at the base", "changing to", and "when mature" are general modifiers for which no adequate machine readable semantics are defined in the current version of SDD. In some cases a rule "2b": order states without modifier before states with modifiers may work, but this would need extensive testing. Especially in characters with a large number of states that may be concurrently present this rule is likely to produce undesirable results.
As a consequence, a method to influence the sequence of multiple states in an object description is frequently requested by biologists.
DELTA always defines the state sequence in each individual description (= item). This is disadvantageous if the terminology is revised and the order of states in the terminology redefined. In many cases the scoring sequence is irrelevant and it would be more desirable to view the states in the order in which they are defined in the terminology.
Two solutions exist to achieve this:
The first solution has the advantage that no decisions are necessary during data entry. It is problematic if the object descriptions are federated across multiple providers and therefore difficult to resequence. Furthermore, resequencing can only be applied to characters where non-standard sequences have never been used.
The second solution has the advantage that changes in the terminology are dynamically and automatically reflected in reports (even in federated databases where the terminology may be changed independently of the descriptions). It makes data entry slightly less intuitive, since in rare situations it is desirable to realize the difference between a defined state sequence and the default sequence. This is especially the case if the desired scoring sequence in a given description is identical to the sequence in the terminology, but should not change if the terminology is changed in the future. To some extent the application can try to foresee this situation, e. g. if the presence of special modifiers is detected.
SDD must make a choice among the following options:
The sequence of states is defined only by the sequence in the terminology. It is not possible to define a different sequence of states in specific descriptions. This option is ideal for distributed databases where the description of a single class may be obtained from several federated database nodes.
The sequence of states is defined separately in each description. The sequence of states in the terminology is ignored when reporting descriptions (it is used only when reporting the terminology). This option was chosen by the CSIRO Delta programs.
Each character in a description provides an optional "Sequence" element with the only possible values "terminology" and "description". If the element is missing, the default value is "terminology", signalling that the sequence of states within this character is not informative and that the sequence of states in the terminology shall be used instead. If the value is "description", the sequence of states in the description is used.
Each state in a description has an optional element "ScoringSequence" containing positive integer numbers. If the element is missing, no explicit scoring sequence is defined and the state sequence defined in the terminology is used for output purposes.
Please send your criticism or suggestions to the SDD mailing list or to the author. Note: The topic was shortly discussed in Brazil 2002, "Sequences of states".
Gregor Hagedorn; Vers. 1; 11. March 2003