TDWG working group: Structure of Descriptive Data (SDD)
The state of a character in an object description can be known, unknown, or uncertainly known. Uncertain knowledge is described in Dallwitz 2000 ("A comparison of Formats for Descriptive Data") as "guessed values". An example of an certainty statement is "probably winged" (i. e. believed to be winged, but this may be incorrect). The uncertainty can arise because the scientist creating the description:
In general uncertainty occurs if the observation process is flawed. The SDD descriptive data standard should provide means to express such doubt in an analytically traceable way.
The certainty of knowledge is especially valuable to preserve doubt when scoring specimen data. These doubt indicators may later be removed after an adequate sample of specimens has been studied.
The probability that a statement is generally true in all individuals of a taxon must be distinguished from the conditional probability that a statement applies to a given individual (i. e. the likelihood that an individual with a given descriptions is encountered). The latter situation can arise in the case of polymorphic species, or in descriptions of higher class taxa (e. g. genera with species that have different states). In the SDD proposal this is a expressed through a frequency modifier statement. Certainty modifiers correspond to a Bayesian concept of probability ("Bayesianism" supports the use of degrees of belief as a basis for statistical practice), frequency modifiers to a relative frequency concept of probability ("frequentism", the "standard" logic used in statistical hypothesis testing).
Note that a statement "Uncertain" applied to a character without giving any further state information is identical to "unknown" (see "Coding Status").
Modifiers should be provided that express the probability that a statement is true. This is called the certainty of the statement. Examples are:
Certainty modifiers apply to states in an object description ("probably winged but uncertain"), not to characters as a whole ("uncertain flower color"). The latter situation is considered to be identical to a state of "unknown", which is covered by coding status (= "missing data indicators") (like "unknown" or "not interpretable"). Coding status values always have a scope of an entire character.
Note that the above list is freely extensible. The modifier definition should provide attributes that allow the definition of a probability estimate. The proposed name for this attribute is ProbabilityEstimate, ranging from 0 to 1. with default 1. For example, searching for modifiers with ProbabilityEstimate < 1 can find any statements that are expressed without full confidence.
A special case of certainty modifiers are misinterpretation modifiers. These may be handled as "certainly not (but true by misinterpretation)". A special boolean attribute IsTrueByMisinterpretation in the modifier definition signals this case, but in addition the ProbabilityRange should be set to 0..0. The advantage of handling misinterpretation modifiers together with certainty modifiers is that software designed to handle the certainty ranges will automatically produce correct analysis results.
Uncertainty modifiers express "uncertainty estimates", as opposed to the "uncertainty measurements" that may be obtained using statistical measures. In principle, uncertainty modifiers should thus apply only to values (= individual observations) and not to statistical measures. In practice, however, the application of statistical measures may be flawed and it is desirable to express this doubt by applying an uncertainty modifier to a statistical measure like a mean. Thus, in contrast to frequency modifiers, certainty modifiers are applicable both to numerical values or statistical measures. They are not applicable to free-form text states.
Is there any perceived need to alternatively allow the direct expression of exact certainty values for statements in the description? Note that frequency modifiers provide such a direct mechanism. Please provide a detailed example which methodology produces such statements and how they are commonly reported.
Numerical values may be estimated. No probability needs to be given if they can be expressed as a range, the borders of which are reliable (30-100). However, are special probability modifiers necessary for "about 100", "approximately 10"? Is this case comparable with probably 10, or does it rather express an unexpressed margin of error? See the separate document SDD proposal: Approximation modifiers for further discussion.
Please send your criticism or suggestions to the SDD mailing list or to the author.
Gregor Hagedorn; Vers. 2.2; 14. May 2004
Earlier versions: Version 1 (2002)