SDD proposal: Special states declaring missing data

TDWG working group: Structure of Descriptive Data (SDD)

Note on the first version of this document

This text is based on the discussions on the last day in Brazil (SDD/TDWG 2002) but goes somewhat beyond it. I am not happy with the presentation, but I feel things are getting clearer - at least to me. I will keep this version as version 1, since it approximately documents the discussion in Brazil (which was not yet presented in detail in the minutes for lack of time and because the material discussed was so complex). In a way the document is sitting between the chair of being minutes and being an organised discussion.

The next version will be reorganized and rewritten!

I fully realize that the document needs to be reworked to

Bob revised the current document a first time, helping me with the logic and improving my English. You therefore already find some of his comments in the document.


Introduction

In descriptive data - as well as in all other data areas - data can be missing for a variety of reasons. Databases usually use a Null or Nothing value to indicate missing data or incompleteness1. For example, the Microsoft SQL Server 2000 documentation gives "absence", "unknown", "undefined", "not applicable", and "to be added later" as possible meanings of the Null value.

To distinguish between the various kinds of missing data, a richer terminology is proposed in this document. Such a terminology is especially important for large projects, where most communication between the collaborating scientists must be achieved through the descriptive data itself.

Information about missing data may be partly an internal management issue (e. g. records marked as incomplete to finish them before a public release). To a large extent, however, it is knowledge and informative to information consumers. Whereas some types of data are definite in that missing information cannot be supplied at a later time and data are either valid or have to be rejected (e. g. banking transactions), descriptive data in biology are potentially never complete. Both for abstract taxonomic objects like species and for physical specimens - science will continually detect new descriptive information and correct previous errors in the descriptions. Descriptive data sets accordingly can only achieve a "well revised" status, but never "finished". The documentation of lack of knowledge and intensity of revision is essential to be able to continue the scientific process of recording descriptive data.

In fact, the entire process of biodiversity description can be seen as a global collaborative project across borders, languages, and generations. Each generation of researchers is building on the progress of the previous generation. Improving this intergenerational communication and the manageability of this global project is consequently an important aim of the SDD standard.

The SDD format is not only concerned with the final and definite result of a scientific process which occurs elsewhere and is hidden from the discussion. Rather, it must be able to adequately reflect the scientific process itself. Knowledge management of missing data is an essential part of this task.

Current situation

The following missing data values are defined in existing applications for descriptive data:

The basic DELTA special states either explain why the data should remain incomplete, or alert the author to revisit and revise the information. In contrast, the expressiveness added by LucID, DeltaAccess, and BioLink all modifies existing categorical or numerical values (see "Supported as modifier" below).

Selection of appropriate name for "missing data values" or "special states"

The following terms for indications of incompleteness, missing values, or data entry status were discussed at the SDD meeting in Brazil (2002):

No clear preference was found, the term "special states" was temporarily adopted2. In the first version of a SDD format these missing data indicators are most likely applicable to characters only. However, in principle the semantic information of some states may be applicable to other data as well.

Note: If you consider one of the above or any other states more appropriate, please make an argument for it on the mailing list!

Special states supported in SDD

In general, the status of data entry falls into 4 categories:

The first category does not need any special handling since the status is evident from the data. For the purpose of identification, the last three categories are subclasses of a general "unknown" or missing data class. They are discussed in "Supported special states" below.

At the SDD meeting in Brazil (2002) the following special states were discussed and accepted for the straw man version of the schema:

Overview:

All these special states apply to entire characters. SDD currently provides no mechanism to express partial knowledge within a character (e. g. to score one state as not applicable, but another as present). Theoretically it is possible that a character is only partly entered, but this is not likely in a well designed terminology (it does happen where character clusters are combined into pseudo-characters, but such a design is considered problematic).

1. Data have not yet been entered

Introduction

The situation discussed here is that no work was done so far, i. e. the character in an object description has not yet been studied and not even the status (whether it will be possible to enter data, or whether this is desirable) has been evaluated.

Initially it seems that the case is well expressed by omitting any character data from the object description. This solution was chosen in the DELTA format, where "not yet evaluated" is expressed by not coding the character in the item description at all.

Omitting data is, in principal, an informative situation. It does, however, have special properties if the case that the terminology is changing ("schema evolution") is considered. If additional characters are introduced in the terminology, they will automatically have the value of the state that is expressed through omitting data. If another default for newly introduced characters is desired, it may be necessary to introduce a special state "not yet evaluated" that is explicitly present in the SDD documents for this case. This is especially relevant if characters are "scoped out" (i. e. a decision is made not to score even where this may be possible, see "Do not want to score" below).

(Note: much of the following discussion needs to be revised and moved into a document about character default states)

Scenario: A project defines a central terminology which is used by multiple object description data bases at various locations. From time to time the terminology is revised and updated. It would be desirable to structure the SDD standard in a way that all necessary information is automatically contained in the SDD documents and no separate inter-process-communication is required to deal with implications of changing terminology. The descriptions should either be automatically valid under the new terminology, or the necessary clean-up operations should be clearly defined and as simple as possible.

Bob: This is a generic issue, not specific to special states. It probably doesn't deserve detailed discussion here. (b). I thought we agreed that project management problems were not part of SDD.

One frequent situation is that characters are added to the terminology. Possible solutions to achieve a synchronization between descriptions and a separate terminology in this case are:

The first option in the list should be considered only if they solve other problems as well. Comparing a full terminology and deciding which changes must be made is a relatively complex process that would be prone to conceptual and implementation errors.

The second option requires a duplication of character references already present in the description. It may be easier for some processors to process a separate and complete list of characters, but this seems to be of lesser importance. If only the missing characters are recorded, the second option becomes identical to the third option, except that the information may be placed in different structures.

The third option may cause larger files than the previous options. However, as descriptive projects mature, not many characters will be left "not yet evaluated"; rather a frequent occurrence of "ScopedOut" is more likely (esp. if the terminology is very large and covers diverse groups). Note that the "non-coding" (omission of a character in the description) here represents a different kind of information, i. e. "previously not present in the terminology".

Default states for characters (which may be defined in the terminology, see the chapter "Character default states" below), have not been discussed so far. Implicitly the previous options assume that if a default state is defined for a character, this is inserted as data at some point during the revision of the descriptions. In the case of new object descriptions it is reasonable to insert the default states during creation. If the default state in a character definition is changed, the question is whether existing descriptions should be changed or not. Note that the assumption that it is logical that default states should be inserted only where data are missing is erroneous. "Not yet evaluated" is a state, whether it is expressed through omitting the character coding or through an explicit state code. Changing "not yet evaluated" to another special or normal state is therefore an overwriting or precedence rule that is irrespective of the coding issue, and several such rules could possibly be defined. If such rules are defined indeed, some attention should be given to the possibility that a default state definition is erroneous and must be reversed at a later time. Overwriting existing states by the changed default state should then either be an explicit user action, or perhaps it should be protected in a transaction to allow a reversal for some time (e. g. until the application session is terminated).

Default states for characters could be used to offer a valuable feature in descriptive data sets: A special character state ("use default") could indicate that the default state of the character should be treated as if it were scored in the description. If the default state definition is changed, all such states would be dynamically be reassigned.

Bob: this is back to history mechanisms. A consumer of an SDD document may need the whole history of changes in the value of the default state, unless it is acceptable that all identifications previously affected by that value are suddenly changed.

The fourth option proposes such a solution. For example, a character could initially be defined with a default state of "not yet evaluated". Whereas some descriptions are explicitly scored using different states, most are left at "use default". As the project matures and the insight into the value of scoring certain characters increases, in many cases the character default state in the terminology could be changed to "scoped out". All descriptions containing the "use default" representation would then no longer be marked as "unfinished work". Again it should be noted, that this solution is independent of the fact whether the "omitted coding" representation is used for this purpose or not.

If the default state may be the "not yet evaluated" state, it is logical to conclude that the default state should be inserted into descriptions if characters are added to the terminology. Since the "omitted coding method" is the best choice for the state that should be added for newly created characters, the "use default" state should be expressed by this method.

From this analysis follows, however, that an additional explicit state code is needed for "not yet evaluated". The functions of "use default" and "not yet evaluated" remain different. Without such a code it would be impossible to declare in a character with a default state "scoped out" that the character in a given item should be scored, but hasn't been so far. Furthermore, an explicit state code is required for the declaration of the default state attribute of the character.

Summary: It is more appropriate to discuss the problem of "not yet explicitly coded" in terms of:

Given this more exhaustive list, any character must have a default state defined. The default state could be one of the above states, another special state, or a normal state.

Conclusion

(This was NOT yet discussed in Brazil:) The omission of character coding in a description should be used to express the "use default" state. If each character must have a default state, assuming the default state is the only reasonable action possible for characters added to the terminology. If omission of character coding is used for "use default", a separate "character not yet present in terminology at time of last revision" is redundant, and no action at all needs to be taken in the description if new characters are added to the terminology. The "not yet evaluated" must be declared as a separate explicit code. It will, however, be rarely inserted explicitly in description; rather it will most often be expressed through inheritance from the default state.

See discussion of Character default states and ScopedOut below! Some duplication of discussion present at the moment, needs revision!

The definitions for these two proposed special states are:

Use character default state

Where this state is present in object descriptions, processors should treat the data as if the character default state defined in the terminology had been scored. However, this should be a dynamic assignment. Without an explicit request from the content authors, the processor may not insert the default states as explicit state codes. If the default state in the character definition is changed, all "use default" states inherit the new character default state.

No explicit code is defined for the "use default" state in SDD. Instead, it is expressed by not coding the character in an object description at all. This coding method has the desirable property that characters newly introduced in the terminology automatically assume their default state.

This state represents the true "not yet evaluated" state in that no active decisions have been taken yet, especially that not even the status (whether it will be possible to enter data, or whether this is desirable) has been evaluated. It is different from all other special states that it has no other semantics than to inform the information consumer that no specific decision was taken. The analytical semantics to be used when processing data are always inherited from the character default state (which may then either express descriptive data or reasons why descriptive data are missing).

Implementation notes: 1. DELTA data, which use omitted character coding for "not yet evaluated" are best imported by importing all "item descriptions" under the assumption of "not yet evaluated" being the default state, and only after this process set the default state to the state defined as "implicit state" in DELTA. — 2. The question whether to provide explicit state codes for some data entry status or to omit the character coding must be evaluated separately for the SDD document standard and for internal data storage in applications. For example, finding missing records in relational databases is a relatively expensive operation. It may be good practice to express a situation that is expressed through "omitted character coding" in the document with an actual value in the database. Such decisions should have, however, no bearing on the SDD standard. 3. The "use default state", if expressed internally through some value, may be selectable by the user in object descriptions (to express "no, I no longer want to make an explicit statement here, return to the default), but it may not be selectable as a character default state).

Note Bob Morris: It might easily be addressed if there is a history mechanism, e. g. each state in a description contains the date or rev of its scoring and each state definition in the terminology carries the date or rev of its addition to the terminology. This need not be onerous if the default for these attributes is the initial date or rev. Except in a very volatile Terminology, most would need nothing added. Applications that wished to do knowledge management would have to take care to notice when states have been updated from the initial terminology.

Unfinished work

This special state implies that no work was done so far, except for a tentative decision that the character should most likely be a first decision to use the character in the description. This is the default value of the character default state.

2. Data cannot be entered

The three states proposed in this section are shown in the following overview:

Special state:information:  collaborator may:  likelihood of revision:
not applicable cannot existdetect that the logic why it should
be inapplicable is wrong
  very rare
unknown does not
exist
decide to research information or
study the object herself or himself
  medium
not interpretable  exists be able to interpret the data  high

Not applicable (= DELTA state "-")

"Not applicable" is the strongest statement that can be made why data cannot be entered. It expresses the assumption that a character cannot, for logical reasons, contain any data (states, text, or numerical data). The concept and problems can best be discussed in the following examples:

Example 1: leaf margin is inapplicable if there are no leaves.

The inapplicability of the "leaf margin" character can and should be defined in a declarative character dependency. If this is done, no scoring of leaf margin will be necessary. This may lead to the situation that inapplicability is sometimes expressed through declarative character dependency, and sometime trough explicit scoring of "not applicable" where not character dependency was defined yet.

However, some inapplicability situations can not easily be expressed through declarative character dependency definitions:

Example 2: growth diameter of a microorganism in 90 mm diameter petri-dishes after 7, 14, 21 days. the result could be 42 mm after 7 days, 82 mm after 14 days and "not applicable" after 21 days. If the Petri-dish is completely overgrown, the result can not be assessed with this method.

Here the inapplicability depends on a combination of method and taxon studied. Although it is possible to devise a declarative mechanism able to express such knowledge, this would be very complicated and seems not desirable. Explicit "not applicable" states are not as expressive (they don't give any explanation why something is not applicable) but are simpler to manage. Furthermore, explicit "not applicable" states allow decoupling of terminology revisions from data entry processes (see the chapter "Relations between declarative character dependency and the "not applicable" special states" below).

Finally an example where the current mechanisms in SDD are unsatisfactorily:

Example 3: The character stipules presence is "inapplicable" in a monocot, since monocot never have any stipules.

This example can be handled in various ways. An obvious solution seems to be to introduce a character "number of cotyledons". However this is erroneous, since being a member of dicots does not imply that all dicots have cotyledons. Some plants may lack cotyledons altogether (e. g. in Anisophylleaceae). The next solution could then be to introduce a pseudo-character "Monocot" or "Dicot" and define a character dependency based on this character. This solution is workable and may be chosen for practical reasons. It is inelegant however, since "Monocot" or "Dicot" is not a directly observable feature but is an inference deduced directly from the taxonomic hierarchy. This generates some problems, e. g. that such pseudo-characters may be undesirable in natural language descriptions or during interactive identification.

The stipules example is a specific case of a general situation that arises whenever a character depends on a part hierarchy that differs for various taxonomic groups. Multiple, taxon-dependent part hierarchies are not yet discussed in the SDD process!

Conclusion2: SDD should support an explicit "not applicable" special state. See also the chapter "Relations between declarative character dependency and the "not applicable" special states" below.

Unknown = data not found

The communication of data completion status and scrutiny (effort to revise the descriptive data) must be communicated not only between collaborators in a project for management purposes, but also to information consumers. It is therefore desirable to distinguish between

The first situation is expressed through the special state "unknown". This indicates that information was searched but could not be found. No scientist can ever verify that information is impossible to obtain, but if proper scientific scrutiny is applied, the "unknown" state of a character in a description does represent a statement about scientific knowledge. This state is equivalent to the DELTA state "U".

Note that the state "U" in DELTA is also used for cases where data are present, but the author is unable to interpret them in current terminology. In SDD this case is handled separately, see Cannot score below. The distinction will create problem when legacy DELTA data are imported, because the new definition is more precise than the old one. This reinterpretation of existing data is, however, considered tolerable.

It would be possible to define more than one level of scrutiny:

Such a distinction adds expressiveness, but it also makes it more difficult to make a decision each time the scientist wishes to express that data are not known. Currently it is felt that a single state is sufficient. Please comment on this!

Bob: It seems to me that this distinction is outside the purposes of SDD. It is not describing the organism but rather that effort of the authors. It may be a meaningless distinction absent substantial knowledge of the authors themselves. Are they competent to make this distinction? Does it depend on their methodology and is that methodology sufficiently identified in the document?

Rendering of "unknown" in data collations: If two species descriptions are summarized/consolidated/collated (@@ decide on term! See separate discussion!):
  Species A: petals 10 mm long
  Species B: petal length unknown
should the collation be "10 mm long or unknown" or just "10 mm long"? Depending on the purpose, both collations make sense. The explicit collation may be especially useful for revision or critical review purposes, whereas the collation ignoring unknown states is more useful for identification purposes.

Not interpretable, "cannot score"

In this situation data are known to be present, but the scientist is unable to interpret either published data or the features of the object that is studied directly.

Open issue: Should a general "cannot score" (for any kind of reason) be differentiated into

Regarding the last point:

When interpreting published data the researcher frequently is in a situation that data are difficult to interpret because the terminology in the published source is insufficiently defined or understood, or because the current terminologies and the terminology of the published description are difficult to map on each other.

The options in these cases are:

A machine processor will currently only be able to interpret the situation when certainty modifiers or "cannot score" has been selected.

3. Data could have been entered, but a deliberate decision was made not to enter them

Do not want to score ("scoped out")

Whereas the data entry situations discussed so far are passive decisions ("cannot score"), this is an active decision not to score for the sake of effectiveness ("don't want to"). The most frequent motivation is that a rarely used feature is required to differentiate a small subset of objects.

Two situation can be distinguished:

For new LucID versions it is planned to have a feature that defines characters as being applicable only to defined taxa. This decision would be taken by the designer of the terminology. In most cases this feature is expected to be used not because the character would be inapplicable in the other taxa, but because it would be irrelevant and therefore a waste of time to code it in object descriptions or to use it during identification. To distinguish this intentional not-scoring from character dependency / logical inapplicability it is called character scoping. A character is "scoped out" if it is inapplicable by the scope rules for a given description.

Example: character "base of filaments swollen" is needed for three species in a genus. The distribution of the state in most species is, however, unknown and it is considered to be not relevant or not cost-effective to attempt to score the character for the entire matrix. Therefore, the character is "scoped" to be only applicable to these three species.

Scoping has implications during aggregation/collation: if multiple descriptions are collated that are only partly in the scope of a character (e. g. for a genus collation a character is "scoped-out" in some species whereas it is recorded in others), the collation could either list the recorded states and add "unknown" for descriptions where the character had been scoped out, or it could suppress any character that is not scoped for the entire range of objects being collated.

If a character could be "scoped-out" only by scoring an special state, it could only be applied to existing descriptions. Although this could be done efficiently in a batch operation, two problems remain: a) the decision about scoring is not documented in the terminology, and b) new descriptions would initially always be "scoped-in", and the scoping decisions would have to be repeated.

Which is more likely: Exclude some items from scope ("scope out"), or include some items in scope ("scope in")?

  1. If "scope-in" is more frequent new taxa should be included in the scope of the character.
    Example: "Please don't score the 100 deep sea fishes out of all my fish descriptions".
  2. If "scope-out" is more frequent (which seems likely) new taxa should be excluded from the scope of the character.
    Example: "Please only score the 100 deep sea fishes out of all my fish descriptions".

Both situations occur. One solution to document the scoping decision would be to introduce a boolean character attribute ("in scope"/"out of scope") to record the scoping decision.

However, another solution seems to be possible that integrates well with the existing mechanisms for special states. "Scoped out" is defined as a special state that a) can be explicitly scored in descriptions, and b) can be used as the value of the default state for a character. Setting the default state would document the scoping default in the terminology. Moreover, a general method is planned (@ but not yet formalized! @) in SDD to provide an inheritance tree of multiple default state definitions along the taxonomic hierarchy, which would be very useful for scoping decisions as well.

Conclusion: An implementation for character scoping should not introduce new methods and data items. The functionality can be realized using the mechanism of character default states and a new special state "scoped-out". Handling scoping through a special state has additional advantages during collation and aggregation. Instead of the "unknown" state mentioned above, the correct "scoped-out" state could be added to the collated genus description.

Recommendation to application designers: In data recording, whether or not to display scoped-out characters may be optional. In the context of the identification, the scoped character should only appear once the scoped items are in the majority.

Do not need to score ("inherit from parent", "aggregate from children")

(This was not yet discussed in Brazil)

If descriptive information is not readily available, but it is reasonable to assume that the information is identical with a parent object (e. g. the family of a species currently being described), it is desirable that the assumption rather than the inherited data are recorded. Feature inheritance is often incomplete or fuzzy biology, i. e. although most species in a family have the features considered characteristic for the family, some may not have it. The reasons for this may be incomplete knowledge about variation, or the presence of aberrant cases which are purposely ignored to keep the family description memorable. If the character state deduced from the family description would simply be scored in each species, the descriptions could not be revised in the future, because it would be impossible to distinguish between knowledge about studied characters, and assumptions deduced from the taxonomic hierarchy.

Two methods to record the fact of an assumption are possible:

  1. the inherited/deduced states are recorded in the description, and marked with modifiers identifying their origin from an assumption
  2. instead of states, a special state "inherit from parent" is recorded and it is the responsibility of the processor to treat the description as if the inherited states were present.

The first method has the advantage that the result of the inheritance is already "cached" in the data, making it easier to query data. This should be of secondary importance in regard to the SDD document standard, however. The advantage is balanced by the disadvantage that changes in the parent are not automatically reflected in the children.

Both methods allow to combine inherited states with newly added ones, if the "inherit" special state can be combined with other states:

Modifier method:

States of char. 1:  1   2   3
Parent:             X   X   -
  Child:            M   M   X
    Grandchild:     M   M   M

Legend: X = scored, M = physically present state, marked by modifier as inherited


Special state method:

States of char. 1:  1   2   3  "inherit" (= special state)  
Parent:             X   X   -     -
  Child:            I   I   X     X
    Grandchild:     I   I   I     X

Legend: X = scored, I = inherited, implicitly present because "inherit" special state is scored.

Open issues: Some information you want to inherit from parent, other not because the character is too variable. You also want to contradict! If you enter data in a character, you probably want the inheritance process from parent to stop, the new data resets the collection of states present in the entire character. However, is it necessary to inherit most states in a character, but contradict in selected states? This seems to be impossible with the special state method!

Conclusion: None yet. My feeling is that we may need both, the special state (always inherit from parent) and the modifier method (this information was inherited). The special state method reports a general assumption that can be updated by a processor when the parent description changes, the modifier method reports the result of an active interpretation process. In the latter case information is compiled or inherited once through a method in the application, but then revised. Only the modifier method may allow to add comments/annotations to inherited states. (This needs to be discussed, since it implies that the annotations of categorical states are not inherited, but always local!)

BioLink seems to implement both options (email S. Shattuck to SDD-list, March 13, 2003).

Note that the discussion above only refers to inheritance from above, not to compilation/aggregation/collation from children, which has similar, but slightly different problems.


Additional special states supported in SDD for computed data

SDD is considering introducing computed characters (data generated automatically on the basis of terminological relationships and existing data). For this purpose the following special states can be defined in addition to the user-selectable ones:

Not supported as special state, but supported through modifiers

The following expressions are related to the issue of missing information, but differ in that they require a numeric or categorical statement to be applied (whereas special states are used instead of a state). They are supported through frequency, certainty, or other modifiers.

Special states not supported in SDD

The DELTA special state "Variable" (= "V") has been used in various ways in DELTA data sets. It is difficult to analyze what the intention of the author was. A score of "Variable" may mean that:

Regarding the first two cases: To express a known and studied polymorphism it is more appropriate to score all applicable states. This represents the knowledge most adequately and is relatively independent of changes which may occur in the terminology (e. g. further states may be added). Although it is not appropriate to output a list of all states in natural language descriptions, the wording "variable" can easily be achieved through an analysis of the descriptive data. If all states are present a natural language report can render such "saturated characters" as "variable". Saturation may be relative: in a character with 10 states it may be appropriate to render the character as "variable" already if more than 5 states are scored.

RFC: should an absolute (e. g. "> 5 states") or relative (e. g. "> 80%") saturation threshold be definable on a per-character basis? Should the wording for saturated be definable on a per-character basis, e. g. to be able to generate "wide-spread" for geographic distribution pseudo-characters?

It would be possible to introduce a new "saturated" special state instead of the ambiguous "variable" state. If all states would have to be scored, scoring the "saturated"-state would be faster. However, translating it into all states is an additional burden all analysis processors would have to do before being able to interpret the data. Furthermore, a state (e. g. "4") may be added that was not originally present in the terminology. After this change "1/2/3" means: "not 4", which was not the intention of the data set author (= general problem of schema evolution).

Regarding the third case: Using the "Scope-out" / "not to be scored" special state seems to be more appropriate.

Regarding the last case: SDD defines modifiers to express probability/uncertainty in ways that were not readily available in DELTA.

Conclusion2: The DELTA state "V" is not supported in the SDD standard. Its semantics are difficult to define and its application is doubtful. This decision may be reversed if an analysis demonstrates that there is a need for this state and that its semantics can be adequately defined.

Recommendation for importing legacy data: When DELTA data sets are converted to SDD, the state V could be converted to a normal state (with a label saying "variable") which could be automatically added to the terminology by the import procedure.


Are special states exclusive?

Special states express knowledge about why data for a given character are missing in a description and thus make a statement about the entire character. This makes it likely that special states should not occur together with other special states or with categorical or numerical data within the same character in the same description. An alternative proposition is that special states are structurally similar to normal character data and it is possible to combine a special state with other states or numerical information.

If a single object is described, the special-states-exclusive model has the advantage that fewer coding errors are possible. If characters are well defined (and not assemblages of related statements like "enzyme tests"), a character can not be scored in a single object as both having a defined state and being unknown or not applicable. However, descriptions are often descriptions of classes of objects, rather than of single objects.

Example: Two species A and B in a genus are scored as shown in the following table:

 Char. 1Char. 2
Species A has:  leaf absent leaf margin: not applicable (= "-")
Species B has:  leaf present  leaf margin: serrate

What is the result of an collation or aggregation method (e. g. DeltaAccess: "Summarize" method) to generate a description for this genus? It seems desirable to be able to express a mixture of special states and normal states. This is most obvious when looking at natural language reports based on a collation. For the special state "not applicable", two cases can be distinguished:

Natural language reporting could use a rule: "If 'not applicable' is the only state in a character, the output is suppressed. If "not applicable" co-occurs with other states, it is output." It is recommended that application should allow the user to choose whether such a rule is applied or not. If a natural language report is generated for data proof-reading purposes, it may be desirable to output any "not applicable" states.

Note that the problem of mixed statements involving special states is not restricted to a collation of species descriptions into higher taxon description, but may also occur where polymorphism exist within a species (e. g. in some insect groups some individuals may have wings whereas others are wingless).

Conclusion2: Special states like "not applicable" must be treated in a way that allows it to co-occur with other special states, normal categorical states, and numerical data. They are not exclusive within a character. Special states can be treated by the same mechanisms as a normal categorical states, but special rules do apply.


Should special states be implicitly present in each character definition?

Special states are primarily a generic feature, defining a common language understood by both human scientists and computer applications processing the data. However, the designer may wish to influence some aspects of the behavior of special states, viz.:

Both aspects could be defined on a project-wide level or for individual characters. Which mechanism should SDD provide?

DELTA does not define the special states in the "character list" directive; they are implicitly present in each character. Consequently, in DELTA the aspects mentioned above cannot be defined on the project or on a character-by-character level. In contrast, DeltaAccess allows the definition of availability and wording for individual characters. When a new character is created or DELTA data are imported, DeltaAccess automatically creates the full set of special states. However, the designer of the terminology can remove special states as desired, to constrain which special states can be scored in which character.

How important is an explicit definition in each character?

For the "not applicable" state this is probably unimportant or may even be undesirable. Explicit "not applicable" scores are useful to highlight overlooked issues to the designer of the terminology and it is therefore desirable to keep them enabled (see also "Relations between declarative character dependency and special states" below)

For the various unknown states it may be desirable to make a selection which cases of unfinished/unknown are supported and presented to scientists coding descriptions. The "ScopedOut" state will most frequently be excluded from certain characters by the designers of the terminology (the DELTA state "V" is not supported in SDD, otherwise it may be the one most important to exclude!).

Conclusion2: Each character should make explicit references to globally defined special states. If the reference is missing, the special state is not available in a character. This simplifies the design of the system, since the same integrity rules apply to special states as to normal states.

Recommendation: When a new character is created, it should inherit the full set of special states. The designer should remove undesired special states, rather than having to add the desired ones.

Implementation note: The structural decision does not limit applications in their design of the user interface; special states can be listed at the end of the normal state list, in a separate window, or enabled through an entirely different metaphor, e. g. as character properties through check boxes.


Relations between declarative character dependency and special states

(This should go later into a separate document on character dependency!)

As discussed above ("Not applicable" state) it seems desirable to support both declarative character dependency and explicit "not applicable" special states.

On the one hand, declarative character dependency has several advantages: a) it generally constrains the list of available characters in identification or data entry (thus increasing identification or data input efficiency, b) it validates data input (reducing errors), and c) it can be used to explain why something is inapplicable.

On the other hand, a) situations exist that cannot be expressed through character dependency, and b) it seems desirable to decouple the revision of the terminology (and the character dependency rules declared therein) from the recording of descriptive data. Understanding character dependency is a process that often continues during the entire project duration. Work flow management becomes considerably easier if scoring of data and refining the character dependency declarations are decoupled. The explicit "not applicable" states serve as a marker to bring necessary revisions of character dependency to the attention of the designer. This is especially important if it is not possible at all to immediately revise the terminology while entering data. This situation occurs especially in large collaborative projects, where the terminology can only be changed in a consensus process.

Providing both "not applicable" states and declarative dependency is similar to the support of free-form notes within descriptions:

Terminology defines:  Object descriptions allow in addition
Character and state definitions  Reported notes (free-form text)
Character dependency rules  "Not applicable" special state

However, this also makes it necessary to explore the interactions between dependency rules and special states (especially the "not applicable" state). If a character is covered by a character dependency definition and existing data invoke the dependency, it is still possible that either normal or special states are present. Applications should report the presence of normal states in inapplicable characters as an error. Should special states be handled differently?

A statement that data are "unknown" or "not interpretable" clearly implies the potential presence of scorable data and therefore violates the character dependency rules just like normal states. The presence of these special states in inapplicable characters should be reported as an error.

However, explicit "not applicable" statements are redundant rather than logically inconsistent. The presence of such redundant "not applicable" state is currently possible at least in some DELTA based applications.

Consider a character that is inapplicable through a character dependency definition and for which the controlling state is present in the description. Three options to deal with such duplication of character dependency and explicit "not applicable" states in the description are conceivable:

Conclusion2: The explicit "not applicable" special state should be allowed for characters inapplicable through a character dependency definition. The rules and recommendations outlined above under "optionally permit explicit 'not applicable' scoring" should be followed.

Temporary Note 1: in the Paris schema I have added a section "CharacterDependencyRules" in terminology and added a simple controlling state/dependent character logic. This needs further discussions, however! Especially, it is unclear whether inapplicability and applicability dependency can both be expressed as inapplicability, and whether a special mechanism to directly deduce dependency from character hierarchies should be added in a addition to the DELTA-like flat model expressed here.

Temporary Note 2:The relation between the use of inapplicable as a taxon specific default state and character dependency are parallel mechanisms to declare inapplicability. It is unclear at the moment whether structurally similar solutions are possible. This applies not to flat character dependency declarations, but to those dependent on some hierarchy (e. g. part hierarchy: if some part is missing, all dependent parts, and thus all character depending on the entire tree should be inapplicable as well). Part-hierarchy-driven character dependency is another open issue!


Responsibility for validation

Character dependency rules and the combination of explicit "not applicable" special states with other data are not validated by the SDD schema or by XSLT rules directly derived from it. The xml-validator should allow ANY character state in inapplicable characters. The error may be due to either mis-scoring or an inappropriate declaration of character dependency. Requiring data to be always valid in this respect requires in the second case an immediate revision of the character dependency definition in the terminology, which is not always possible in collaborative projects.

However, generic xslt code reporting violations of inapplicability in SDD data sets would be greatly appreciated.

Note: An exception to the rule that applications should report states in inapplicable characters as errors is the case of data marked with a modifier as "present by misinterpretation". In such data the states are truly absent (and are analyzed as such), but are coded to achieve some degree of error tolerance in identification. Such data will appear frequently in inapplicable characters. The exception must be made both for characters inapplicable through declarative character dependency and for characters containing "not applicable".

Note: does this imply that "present by misinterpretation" must possibly be applicable to numeric characters, similar to probability? Should the set of "present by misinterpretation" modifiers be handled as "Certainty modifiers" rather than normal modifiers?


Character default states

For various purposes, but especially for the purpose of managing the special "scope-out" state, the SDD standard provides an attribute "CharacterDefaultState" for character definitions. This default is applied to each character if a new description is created, or to all descriptions if a new character definition is created.

The "CharacterDefaultState" can be set by the designer of the terminology to either a special state (missing values, scoped-in or out) or to a normal state. By default (if the designer makes no other decisions), the value of "CharacterDefaultState" is "not yet scored". [@ Schema may enforce this, needs checking in next review @] The default could also be a normal state, which is especially useful in the case of "pseudo-characters" or "management-characters" (Example of a "management character": Taxon is: 1. ready for release; 2. adequately revised but not yet to be released; 3. inadequately revised).

The mechanism proposed here is related to the DELTA "implicit states". However, it can only be defined for a single state per character. (Note that this is an arbitrary limitation in an attempt to simplify the design and ease the burden of implementers It could be relaxed if sufficient arguments or data challenges to do it are brought forward!)

The proposal goes beyond DELTA, however, in that it is planned to:

  1. extend the implicit state mechanism to special character states
  2. connect it to taxonomic groups, i. e. allow different default states for different taxonomic groups.

The DELTA "implicit states" directive is a global directive. It is very useful to improve data entry efficiency in small projects, but quickly become useless in larger projects, since few default assumptions hold over a larger group of taxa. Furthermore, it is not explicitly defined to operate on special states (although this may be possible in some DELTA-compatible applications).

Note that the mechanism to define taxonomic groups has not yet been decided on. This is a major hurdle in the SDD process! On the one side we agree that taxonomic hierarchy, synonymy and nomenclatural data are best handled by a different linked data area, on the other side we need hierarchy information for operations within descriptive data. This also applies to the problem where descriptive data should be inherited from above or below. We probably need an internal, coarse and operational hierarchy of taxa that are described. However, ideally the mechanism should be smart enough to detect the necessary changes if, e. g. the name of a description is changed (which can happen automatically if a descriptive data application retrieves updated specimen identifications from a collection data provider).

Conclusion: The mechanism to bind default states to a taxonomic hierarchy can not yet be defined. Default states should probably inherited down the taxonomic tree similar to normal states inherited from higher taxa ("inferred from above"). This would allow to define and redefine the default states in the terminology for each character on a different level. Perhaps a special type of description abstract objects should be introduced, objects which are structurally identical to normal descriptions, but are part of the terminology and are used to define default (= implicit) states. A discussion about taxonomic hierarchies should be a priority at one of the following meetings.


xml coding for special states in the SDD model

In Brazil we discussed whether special states should be handled in the object description identically with normal states, or whether a different element name should be used. The following pseudo-code (not using the true element names, which are currently changing in the schema versions!) is intended to give an overview:

In Terminology:
StateDefinition key="x"
SpecialStateDefinition key="N/A"
  (for special states the standard may restrict the values in key!)
CharacterDefinition
  (here a selection of global states can be made, defining which special states are enabled for this character)

In Description:
  for each character a list of states is given, which could be either:

<State keyref="x"><Modifier ...></State>
<SpecialState keyref="N/A" />

or:

<State keyref="x"><Modifier ...></State>
<State keyref="N/A" />

Conclusion2: in the object description, handle state and special state reference structurally analogously and use the same element name for the references.

Note that the treatment of special states as normal states in the SDD character definition does not preclude editor applications from presenting special states as character properties separately from the normal character states. No user interface recommendation to handle special states among the list of normal states is intended.

However: Note that special states may never be modified by any modifiers (e. g. frequency or probability modifiers)! If this is to be included in the SDD schema, it may be better to use a different element type for references to special states.


Open issues

Introduce a "TODO" state?

Is an observation type: "this character does not occur here!" a special case on "not applicable"

Need for "" = empty string (applies only to text, not sure whether makes sense here!)

Are there differences between MissingValues, and NullValues (Liz Kolster defines ZeroValues as well ...)? Applicability to character or states?


Footnotes

Footnote 1 Note that Null values are also supported in the xml SOAP 1.1 protocol: "A NULL value or a default value may be represented by omission of the accessor element. A NULL value may also be indicated by an accessor element containing the attribute xsi:null with value '1' or possibly other application-dependent attributes and values." and "5.5 Default Values: An omitted accessor element implies either a default value or that no value is known. The specifics depend on the accessor, method, and its context. For example, an omitted accessor typically implies a Null value for polymorphic accessors (with the exact meaning of Null accessor-dependent). Likewise, an omitted Boolean accessor typically implies either a False value or that no value is known, and an omitted numeric accessor typically implies either that the value is zero or that no value is known."

Footnote 2: These conclusions were agreed upon on 17. Oct. 2002 at the TDWG-SDD meeting in Brazil. However, any conclusion at the current state is open to revision and new discussions!


Request for discussion

Please send your criticism or suggestions to the SDD mailing list or to the author.

Gregor Hagedorn; Vers. 1.1; 13. March 2003



Return to the SDD starting page.

First published 2003-03-07, last update: 2003-03-23.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser