TDWG working group: Structure of Descriptive Data (SDD)
This is a fragment that was removed from the Special state/Missing data indicator document starting with version 2.0 of that document. The references of default state in that document are relevant, however they are confusing because the decision about implied/etc. states need decisions on how to support an item hierarchy.
DELTA defines an "implicit values" directive like: "*IMPLICIT VALUES 9,2 10,1 16,2 20,2 32,1 65,2 67,3 71-72,2 75,2". The states are defined globally for an entire DELTA file/Character definition. The directive serves two purposes:
A major problem that occurs with this is that the assumption of what is considered implicit depends on the taxonomic group. The DELTA approach to globally define default states for the entire project works well in small projects, but tends to fail in projects encompassing a wider taxonomic diversity.
For various purposes, but especially for the purpose of managing the special "scope-out" state, the SDD standard provides an attribute "CharacterDefaultState" for character definitions. This default is applied to each character if a new description is created, or to all descriptions if a new character definition is created.
Gregor NEW: do we really need this? I almost believe we don't! It would be much more useful to have the default setting in the item tree and inherit certain values downward. Disadvantage: this is never global, i.e. the designer of a terminology has no influence of this.
The "CharacterDefaultState" can be set by the designer of the terminology to either a missing data indicator (missing values, scoped-in or out) or to a normal state. By default (if the designer makes no other decisions), the value of "CharacterDefaultState" is "not yet scored". [@ Schema may enforce this, needs checking in next review @] The default could also be a normal state, which is especially useful in the case of "pseudo-characters" or "management-characters" (Example of a "management character": Taxon is: 1. ready for release; 2. adequately revised but not yet to be released; 3. inadequately revised).
The mechanism proposed here is related to the DELTA "implicit states". However, it can only be defined for a single state per character. (Note that this is an arbitrary limitation in an attempt to simplify the design and ease the burden of implementers It could be relaxed if sufficient arguments or data challenges to do it are brought forward!)
The proposal goes beyond DELTA, however, in that it is planned to:
The DELTA "implicit states" directive is a global directive. It is very useful to improve data entry efficiency in small projects, but quickly become useless in larger projects, since few default assumptions hold over a larger group of taxa. Furthermore, it is not explicitly defined to operate on missing data indicators (although this may be possible in some DELTA-compatible applications).
Note that the mechanism to define taxonomic groups has not yet been decided on. This is a major hurdle in the SDD process! On the one side we agree that taxonomic hierarchy, synonymy and nomenclatural data are best handled by a different linked data area, on the other side we need hierarchy information for operations within descriptive data. This also applies to the problem where descriptive data should be inherited from above or below. We probably need an internal, coarse and operational hierarchy of taxa that are described. However, ideally the mechanism should be smart enough to detect the necessary changes if, e. g. the name of a description is changed (which can happen automatically if a descriptive data application retrieves updated specimen identifications from a collection data provider).
Conclusion: The mechanism to bind default states to a taxonomic hierarchy can not yet be defined. Default states should probably inherited down the taxonomic tree similar to normal states inherited from higher taxa ("inferred from above"). This would allow to define and redefine the default states in the terminology for each character on a different level. Perhaps a special type of description abstract objects should be introduced, objects which are structurally identical to normal descriptions, but are part of the terminology and are used to define default (= implicit) states. A discussion about taxonomic hierarchies should be a priority at one of the following meetings.
(This was not yet discussed in Brazil)
If descriptive information is not readily available, but it is reasonable to assume that the information is identical with a parent object (e. g. the family of a species currently being described), it is desirable that the assumption rather than the inherited data are recorded. Feature inheritance is often incomplete or fuzzy biology, i. e. although most species in a family have the features considered characteristic for the family, some may not have it. The reasons for this may be incomplete knowledge about variation, or the presence of aberrant cases which are purposely ignored to keep the family description memorable. If the character state deduced from the family description would simply be scored in each species, the descriptions could not be revised in the future, because it would be impossible to distinguish between knowledge about studied characters, and assumptions deduced from the taxonomic hierarchy.
Two methods to record the fact of an assumption are possible:
The first method has the advantage that the result of the inheritance is already "cached" in the data, making it easier to query data. This should be of secondary importance in regard to the SDD document standard, however. The advantage is balanced by the disadvantage that changes in the parent are not automatically reflected in the children.
Both methods allow to combine inherited states with newly added ones, if the "inherit" missing data indicator can be combined with other states:
Modifier method:
States of char. 1: 1 2 3
Parent: X X -
Child: M M X
Grandchild: M M M
Legend: X = scored, M = physically present state, marked by modifier as inherited
Missing data indicator method:
States of char. 1: 1 2 3 "inherit" (= missing data indicator)
Parent: X X - -
Child: I I X X
Grandchild: I I I X
Legend: X = scored, I = inherited, implicitly present because "inherit" missing data indicator is scored.
Open issues: Some information you want to inherit from parent, other not because the character is too variable. You also want to contradict! If you enter data in a character, you probably want the inheritance process from parent to stop, the new data resets the collection of states present in the entire character. However, is it necessary to inherit most states in a character, but contradict in selected states? This seems to be impossible with the missing data indicator method!
Conclusion: None yet. My feeling is that we may need both, the missing data indicator (always inherit from parent) and the modifier method (this information was inherited). The missing data indicator method reports a general assumption that can be updated by a processor when the parent description changes, the modifier method reports the result of an active interpretation process. In the latter case information is compiled or inherited once through a method in the application, but then revised. Only the modifier method may allow to add comments/annotations to inherited states. (This needs to be discussed, since it implies that the annotations of categorical states are not inherited, but always local!)
BioLink seems to implement both options (email S. Shattuck to SDD-list, March 13, 2003).
Note that the discussion above only refers to inheritance from above, not to compilation/aggregation/collation from children, which has similar, but slightly different problems.
The situation discussed here is that no work was done so far, i. e. the character in an object description has not yet been studied and not even the status (whether it will be possible to enter data, or whether this is desirable) has been evaluated.
Initially it seems that the case is well expressed by omitting any character data from the object description. This solution was chosen in the DELTA format, where "not yet evaluated" is expressed by not coding the character in the item description at all.
Omitting data is, in principal, an informative situation. It does, however, have special properties if the case that the terminology is changing ("schema evolution") is considered. If additional characters are introduced in the terminology, they will automatically have the value of the state that is expressed through omitting data. If another default for newly introduced characters is desired, it may be necessary to introduce a missing data indicator "not yet evaluated" that is explicitly present in the SDD documents for this case. This is especially relevant if characters are "scoped out" (i. e. a decision is made not to score even where this may be possible, see "Do not want to score" below).
(Note: much of the following discussion needs to be revised and moved into a document about character default states)
Scenario: A project defines a central terminology which is used by multiple object description data bases at various locations, communicating through SDD xml documents. From time to time the terminology is revised and updated. It would be desirable to structure SDD in a way that all necessary information is automatically contained in the SDD documents and no separate inter-process-communication is required to deal with implications of changing terminology. The descriptions should either be automatically valid under the new terminology, or the necessary clean-up operations should be clearly defined and as simple as possible.
Bob: This is a generic issue, not specific to missing data indicators. It probably doesn't deserve detailed discussion here. (b). I thought we agreed that project management problems were not part of SDD.
One frequent situation is that characters are added to the terminology. Possible solutions to achieve a synchronization between descriptions and a separate terminology in this case are:
The first option in the list should be considered only if they solve other problems as well. Comparing a full terminology and deciding which changes must be made is a relatively complex process that would be prone to conceptual and implementation errors.
The second option requires a duplication of character references already present in the description. It may be easier for some processors to process a separate and complete list of characters, but this seems to be of lesser importance. If only the missing characters are recorded, the second option becomes identical to the third option, except that the information may be placed in different structures.
The third option may cause larger files than the previous options. However, as descriptive projects mature, not many characters will be left "not yet evaluated"; rather a frequent occurrence of "ScopedOut" is more likely (esp. if the terminology is very large and covers diverse groups). Note that the "non-coding" (omission of a character in the description) here represents a different kind of information, i. e. "previously not present in the terminology".
Default states for characters (which may be defined in the terminology, see the chapter "Character default states" below), have not been discussed so far. Implicitly the previous options assume that if a default state is defined for a character, this is inserted as data at some point during the revision of the descriptions. In the case of new object descriptions it is reasonable to insert the default states during creation. If the default state in a character definition is changed, the question is whether existing descriptions should be changed or not. Note that the assumption that it is logical that default states should be inserted only where data are missing is erroneous. "Not yet evaluated" is a state, whether it is expressed through omitting the character coding or through an explicit state code. Changing "not yet evaluated" to another special or normal state is therefore an overwriting or precedence rule that is irrespective of the coding issue, and several such rules could possibly be defined. If such rules are defined indeed, some attention should be given to the possibility that a default state definition is erroneous and must be reversed at a later time. Overwriting existing states by the changed default state should then either be an explicit user action, or perhaps it should be protected in a transaction to allow a reversal for some time (e. g. until the application session is terminated).
Default states for characters could be used to offer a valuable feature in descriptive data sets: A special character state ("use default") could indicate that the default state of the character should be treated as if it were scored in the description. If the default state definition is changed, all such states would be dynamically be reassigned.
Bob: this is back to history mechanisms. A consumer of an SDD document may need the whole history of changes in the value of the default state, unless it is acceptable that all identifications previously affected by that value are suddenly changed.
The fourth option proposes such a solution. For example, a character could initially be defined with a default state of "not yet evaluated". Whereas some descriptions are explicitly scored using different states, most are left at "use default". As the project matures and the insight into the value of scoring certain characters increases, in many cases the character default state in the terminology could be changed to "scoped out". All descriptions containing the "use default" representation would then no longer be marked as "unfinished work". Again it should be noted, that this solution is independent of the fact whether the "omitted coding" representation is used for this purpose or not.
If the default state may be the "not yet evaluated" state, it is logical to conclude that the default state should be inserted into descriptions if characters are added to the terminology. Since the "omitted coding method" is the best choice for the state that should be added for newly created characters, the "use default" state should be expressed by this method.
From this analysis follows, however, that an additional explicit state code is needed for "not yet evaluated". The functions of "use default" and "not yet evaluated" remain different. Without such a code it would be impossible to declare in a character with a default state "scoped out" that the character in a given item should be scored, but hasn't been so far. Furthermore, an explicit state code is required for the declaration of the default state attribute of the character.
Summary: It is more appropriate to discuss the problem of "not yet explicitly coded" in terms of:
Given this more exhaustive list, any character must have a default state defined. The default state could be one of the above states, another missing data indicator, or a normal state.
General Note: "Use default" is not a pointer to data present elsewhere to speed up data entry or save storage space, but rather an explicit decision that information implied from higher taxa is appropriate!
(This was NOT yet discussed in Brazil:) The omission of character coding in a description should be used to express the "use default" situation. If each character must have a default state, assuming the default state is the only reasonable action possible for characters added to the terminology. If omission of character coding is used for "use default", a separate "character not yet present in terminology at time of last revision" is redundant, and no action at all needs to be taken in the description if new characters are added to the terminology. The "not yet evaluated" must be declared as a separate explicit code. It will, however, be rarely inserted explicitly in description; rather it will most often be expressed through inheritance from the default state.
See discussion of Character default states and ScopedOut below! Some duplication of discussion present at the moment, needs revision!
The definition for the proposed missing data indicator is:
Where this state is present in object descriptions, processors should treat the data as if the character default state defined in the terminology had been scored. However, this should be a dynamic assignment. Without an explicit request from the content authors, the processor may not insert the default states as explicit state codes. If the default state in the character definition is changed, all "use default" states inherit the new character default state.
No explicit code is defined for the "use default" state in SDD. Instead, it is expressed by not coding the character in an object description at all. This coding method has the desirable property that characters newly introduced in the terminology automatically assume their default state.
This state represents the true "not yet evaluated" state in that no active decisions have been taken yet, especially that not even the status (whether it will be possible to enter data, or whether this is desirable) has been evaluated. It is different from all other missing data indicators that it has no other semantics than to inform the information consumer that no specific decision was taken. The analytical semantics to be used when processing data are always inherited from the character default state (which may then either express descriptive data or reasons why descriptive data are missing).
Implementation notes: 1. DELTA data, which use omitted character coding for "not yet evaluated" are best imported by importing all "item descriptions" under the assumption of "not yet evaluated" being the default state, and only after this process set the default state to the state defined as "implicit state" in DELTA. — 2. The question whether to provide explicit state codes for some data entry status or to omit the character coding must be evaluated separately for the SDD document standard and for internal data storage in applications. For example, finding missing records in relational databases is a relatively expensive operation. It may be good practice to express a situation that is expressed through "omitted character coding" in the document with an actual value in the database. Such decisions should have, however, no bearing on the SDD standard. 3. The "use default state", if expressed internally through some value, may be selectable by the user in object descriptions (to express "no, I no longer want to make an explicit statement here, return to the default), but it may not be selectable as a character default state).
Note Bob Morris: It might easily be addressed if there is a history mechanism, e. g. each state in a description contains the date or rev of its scoring and each state definition in the terminology carries the date or rev of its addition to the terminology. This need not be onerous if the default for these attributes is the initial date or rev. Except in a very volatile Terminology, most would need nothing added. Applications that wished to do knowledge management would have to take care to notice when states have been updated from the initial terminology.
Please send your criticism or suggestions to the SDD mailing list or to the author.
Gregor Hagedorn; Vers. 2; 28. August 2003