SDD Proposal: Scoring sequence of states in descriptions

TDWG working group: Structure of Descriptive Data (SDD)

Introduction

As discussed in "Categorical data types", character states have two types of sequence or ordering relations: A sequence in which they shall be displayed in reports and a sequence that is derived from the biological semantics of the states. A display sequence is defined for nominal and ordinal categorical data types, the inner ordering only for ordinal data. Ordinal data express the knowledge that, e. g. for states 1-2-3 it is known that the distance from 1 to 3 is longer than that from 1 to 2 (even though the distances themselves are undefined; compare cardinal scale (= integer data) with equal distance and interval scale (=real numeric data) with known distances.

However, the sequence of state in the terminology is not necessarily identical with the sequence in specific object or class descriptions. For example, a character with the ordered states:
   1. very short bristles
   2. short bristles
   3. long bristles
   4. very long bristles
may appear in a specific object description as "4 or rarely 3" as well as "3 or perhaps 4". To distinguish the latter situation from the state order defined in the terminology, it is called the "scoring sequence" in SDD documents.

Usage in existing applications

Arguments for "scoring sequences" in object descriptions

Ideally an explicit "scoring sequence" should not be necessary in a descriptive data standard and all state ordering should be achieved using a set of rules.

In a report (e. g. natural language description), multiple states within a single character in a description can be ordered by:

These rules would already handle the examples above appropriately. It is, however, difficult to capture all the information in an analytically accessible way. Firstly, in many cases no explicit information is present whether state order in a description is significant or not. Most biologists will assume that "round or obovate" implies an unequal (but unspecified) frequency of round and obovate and is therefore different from "obovate or round". One may view this as a "bad habit" in biology, but it is probably unwise to force people abandoning such habits, and in the case of existing descriptions that are being recoded it may be impossible to provide analytically accessible information like frequency values.

Furthermore, even where additional information is present, this information may not be analytically accessible. For example, the rules do not cover "flower blue, or violet (at the base)" or "flower violet changing to red when mature". The modifier phrases "at the base", "changing to", and "when mature" are general modifiers for which no adequate machine readable semantics are defined in the current version of SDD. In some cases a rule "2b": order states without modifier before states with modifiers may work, but this would need extensive testing. Especially in characters with a large number of states that may be concurrently present this rule is likely to produce undesirable results.

As a consequence, a method to influence the sequence of multiple states in an object description is frequently requested by biologists.

Problems with "scoring sequences" in descriptions

DELTA always defines the state sequence in each individual description (= item). This is disadvantageous if the terminology is revised and the order of states in the terminology redefined. Furthermore, if data are aggregated (e. g. descriptions of multiple specimens combined into a new species description), the scoring sequence may have to be ignored, unless the scoring sequences in the aggregated descriptions are compatible.

In many cases the scoring sequence is irrelevant and it would be more desirable to view the states in the order in which they are defined in the terminology. Two solutions exist to achieve this:

The first solution has the advantage that no decisions are necessary during data entry. It is problematic if the object descriptions are federated across multiple providers and therefore difficult to resequence. Furthermore, resequencing can only be applied to characters where non-standard sequences have never been used.

The second solution has the advantage that changes in the terminology are dynamically and automatically reflected in reports (even in federated databases where the terminology may be changed independently of the descriptions). It makes data entry slightly less intuitive, since in rare situations it is desirable to realize the difference between a defined state sequence and the default sequence. This is especially the case if the desired scoring sequence in a given description is identical to the sequence in the terminology, but should not change if the terminology is changed in the future. To some extent the application can try to foresee this situation, e. g. if the presence of special modifiers is detected.

Summary of options to define state sequence in descriptions

SDD must make a choice among the following options:

Option 1 (terminology sequence only)

The sequence of states is defined only by the sequence in the terminology. It is not possible to define a different sequence of states in specific descriptions. This option is ideal for distributed databases where the description of a single class may be obtained from several federated database nodes.

Option 2 (description sequence only)

The sequence of states is defined separately in each description. The sequence of states in the terminology is ignored when reporting descriptions (it is used only when reporting the terminology). This option was chosen by the CSIRO Delta programs.

Option 3 a (optionally description sequence, per character definition)

Each character in a description provides an optional "Sequence" element with the only possible values "terminology" and "description". If the element is missing, the default value is "terminology", signalling that the sequence of states within this character is not informative and that the sequence of states in the terminology shall be used instead. If the value is "description", the sequence of states in the description is used.

Option 3 b (optionally description sequence, per state definitions)

Each state in a description has an optional element "ScoringSequence" containing positive integer numbers. If the element is missing, no explicit scoring sequence is defined and the state sequence defined in the terminology is used for output purposes.

Decision

The topic was first raised in Brazil 2002, "Sequences of states" after which the first version of this document was prepared. The topic was then again discussed in Lisbon 2003 with the following conclusion:

The first version of SDD attempts to uses a per description, per-character data element "Sequence" (Option 3 a, optionally description sequence, per character definition). Experiences with this and arguments pro and contra this choice are welcome and will be added to this document.

Scoring sequences are only available to categorical states and coding status values. The are not supported for statistical parameters. Repeated observations in an observation set (see @@@@) are a higher level of ordering and not affected by ScoringSequence elements.

Conclusions based on the choice above

It remains the choice of the application, whether it supports per-description state sequences always (like CSIRO DELTA), optionally based on a decision of the user, or never. If the application supports a user choice, it is the decision of the application developers which state sequence is considered the default: "terminology" (i. e. by default the sequence of scoring states is considered irrelevant, user must take action to record entry sequence) or "description" (entry sequence is stored, user must take action to ignore and use the terminology default). However, writers of SDD conforming applications are encouraged to respect the Sequence element in the following way:

Recommendations for application developers:

With Sequence = "terminology", the state sequence in descriptions should be based on the order of states in the terminology. However, if frequency or certainty modifiers are present in the scored states, these should override the terminology-based ordering in a description. The following ordering precedence is proposed:

Recommendation for database implementation

Rather than storing the character-level Sequence attribute, relational database application may consider the following design: Each state in a descriptions carries a required integer attribute ManualSequence with the following semantics:
  = 0 : use the sequence defined in the terminology (database default value)
  > 0 : define a description-specific state sequence
By default, all states added will remain at 0. If the sequence is manipulated by the user, ManualSequence permanently stores ascending numbers > 0.

The advantage to this approach is that sorting descriptions becomes a simple (using pseudo-fieldnames): "Order By CharacterSequence, ManualSequence, StateSequenceInDefinition". All fields can be indexed and ordering will be much faster, than ordering by a calculated value containing if/else logic evaluating the Sequence attribute.

See also Different collection types in the SDD schema for a general overview of the occurrence of sequence and set collections in the SDD schema.

Request for discussion

Please send your criticism or suggestions to the SDD mailing list or to the author.

Gregor Hagedorn; Vers. 2; 31. October 2003
Earlier versions: Version 1



Return to the SDD starting page.

First published 2003-03-11, last update: 2003-10-31.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser