TDWG working group: Structure of Descriptive Data (SDD)
Occasionally it has been argued that free-form text is unnecessary for the new SDD standard. Although free-form text may be problematic or irrelevant in many analysis situations (including phylogenetic analysis and interactive identifications) it is a valuable feature in many other situations. This document attempts to provide arguments why free-form text should be supported and analyzes which free-form data elements are required. For a quick look at the results of the analysis please go to "Proposal" at the end.
Note: This discussion is only concerned with free-form text that may be present in the object descriptions. Many additional label and wording elements are available in the section concerned with the definition of terminology.
Traditionally, all organism descriptions in biology have been prepared as free-form text. Most authors have realized the importance of consistent and well defined descriptive terminology and have strived to create consistent and complete descriptions in free-form text. This is, however, difficult to achieve. Unsurprisingly, attempts to convert existing descriptions into structured databases usually reveal many shortcomings of current biodiversity descriptions.
Numerical taxonomy, including distance clustering and phylogenetic methods (e. g. parsimony analysis) were important attempts to improve objectivity and consistency of biodiversity descriptions. The great achievement of the DELTA format was to combine categorical information (nominal and ordinal scale) and numerical information (cardinal and interval scale) with free-format text elements. This made DELTA usable for general purpose descriptions as well as for identification and analysis applications.
Some free-form text elements present in DELTA should be reconsidered in the new SDD standard. To improve interactive identification, data analysis, and translation into multiple languages elements like frequency modifiers or comments expressing certainty should be moved from free-form text to structured data. However, the general arguments for the necessity of free-form text remain valid and such elements should be included in the new standard.
Textual data elements in descriptions:
Each of these textual data elements requires a separate discussion.
The need for free-form text characters arises in the following situations:
Although only in relatively few situations free-form text characters are ultimately the best choice, the presence of free-form text elements greatly enhances the flexibility of applications. The acceptance of a standard requiring a rigorous and laborious definition of terminology by any content author is greatly enhanced if free-form text characters (and also user-definable state extension, see below) allow the gradual introduction of a strict terminology, rather than forcing the content author to solve terminological problems before starting to enter data.
The text in a text-character may be:
More information about the proposal to introduce basic formatting markup into the SDD standard interchange format is available in the document "Formatted free-form text".
It is often difficult to capture all categories of a property (e. g. "shape") in the terminology before starting to work on the descriptions, because state categories may be rare ("hamate shape") or applicable only to few organisms. Many existing DELTA-based descriptive data applications (e. g. LucID, Pankey/Pandora, the CSIRO DELTA programs, or DeltaAccess) allow the dynamic addition of character states while working on object descriptions. This enables the content and terminology author to identify a problem, refactor the terminology, and continue data entry work.
This approach, however, will often be impossible in collaborative working situations. A document-based system requires that only a single person is working on a project at the time of state-reorganization. A database-based system may allow reorganizations of the terminology as long as all users work on the same central database. Still, if distributed copies or database replication are used, changing the terminology becomes increasingly difficult. Furthermore, in large collaborative projects, it may be desirable to centralize the responsibility for changing the terminology, or changes must be agreed upon after a discussion process.
Consequently, a free-form text "category" as a catch-all remainder is an important feature. Such a category ("Other") is frequently encountered in sociological or market research questionnaires and serves to supply rare or unexpected information:
Profession:
Farmer
Teacher
...
Nuclear Physicist
Other (please specify): [___________________]
Such a feature can act as an important feed-back mechanism about categories that may have been overlooked during the definition of the terminology. Information present in the "Other" free-form text category can be analyzed and migrated to new categories if the terminology is expanded.
This mixture is inherently (and not well documented) present in the DELTA format. Any categorical (UM/OM) or numerical (IN/RN) character always can be used as a text character. If character 2 is TE (text) and character 3 is UM (nominal categorical), the following can be coded:
| 2,<free-form text> | ||
| 3,<free-form text> | (here the text replaces a categorical state) | |
| 3,4<free-form text> | (here the text is a comment on the first state) | |
| 3,4/<free-form text> | (here the text is a second state) |
In this case the basic information is already captured in a structured format and only additions or annotations are added as free-form text.
When creating natural language reports, the "wording" or "phrasing" defined in the general terminology and the state-specific annotations within the description are combined to produce the report. An important issue is, how closely these two are expected to work together.
The BioLink application proposes free-form text elements that can be inserted before or after the state code (as "textBefore" and "textAfter" attributes)1. These elements have no semantic definition, other than that they work hand-in-hand together with the wording defined in the terminology for characters and states. This produces several problems. Firstly, it makes the creation of multilingual data sets very difficult, because data can not be published without translating every single piece of annotation. This is not the case if the annotations are output in a separate place, where it would be acceptable to have a readable Spanish description, with occasional English annotations in brackets. Secondly, and most importantly, the task of expanding and improving the character terminology is made almost impossible. Any change in the labeling or wording of characters and states requires a reassessment and revision of every object description using this information in combination with a "textBefore"/"textAfter" annotation.
This is not the case if annotations and terminology are decoupled. Annotations should be considered independent data items that have to be understandable without the immediate context of the wording choosen in the terminology. "textBefore"/ "textAfter" elements should not be supported for this reason.
1 The BioLink document "Enhanced Item Descriptions" (publ. 2001) states that "the restrictions on the placement of comments imposed by DELTA will be relaxed so that character values and comments can be intermixed as required". However, the preferred model is currently only slightly relaxed, using fixed "textBefore" and "textAfter" attributes (S. Shattuck, pers. comm. 2002).
No need for language-unspecific state annotations could be identified. It is therefore proposed that state annotations are always language specific.
Furthermore, content authors frequently request basic text formatting in annotations. It is proposed that state annotations provide limited text formatting, see the separate document "Basic text formatting". It may be desirable to allow terminology designers to limit this ability in user interfaces. This could be defined on the project, character, or state level. Currently such an option is not proposed by SDD.
In contrast to DELTA and especially the DELTA-2 proposal, the following text features should be rejected:
This topic has not yet been sufficiently explored in the discussions. It strongly depends on the availability to validate or pick list information from external data sources, e. g. lists of host names, pathogens, etc. Please do raise the subject in the discussions!
Text elements in DELTA / DELTA-2 descriptions:
| Character type | Descriptive data | |||||||||||||||
| Text character | [Comment] A text character may contain any [Comment] combination of text and [Comment]s | |||||||||||||||
| Categorical character |
| |||||||||||||||
| Numerical character |
| |||||||||||||||
Note that an explicit free-form text category for "Other" is not available. In categorical characters it can be simulated through an "other" category, but this would be treated as analytically meaningful by DELTA applications (e. g. user could be asked in keys: "is it 'other' or not?").
Text elements in SDD descriptions:
| Character type | Modi- fiers | State | Numerical data | Reported Note | Internal Note | |
| Text character | - | (text state) | - | Free-form text may contain limited formatting, e. g. superscript, subscript, or italics. | Internal Note | |
| Categorical character | usually | red | - | Reported Note | Internal Note | |
| rarely | green | - | Reported Note | Internal Note | ||
| - | "Other:" (text state) | - | User-supplied category | Internal Note | ||
| Numerical character | (minimum) | 1 | Reported Note | Internal Note | ||
| (lower range) | 2 | Reported Note | Internal Note | |||
| (mean) | 5 | Reported Note | Internal Note | |||
| (upper range) | 7 | Reported Note | Internal Note | |||
| probably | (maximum) | 12 | Reported Note | Internal Note | ||
| - | "Other:" (text state) | - | User-supplied statistic | Internal Note | ||
Any element (perhaps with the exception of coding status (= "missing data indicators", = "special states") like "unknown", "not applicable", etc.; this needs to be discussed!) may have an optional reported note. Any element (including coding status values) may have an optional internal note. "-" indicates that the column is not applicable to the state.
The following constraints are considered desirable to simplify the structural model:
This proposal is designed to handle free-form text as a) single child of a character, b) sister or other child elements of a character, and c) child element of states with a minimum of structural complexity.
Please send your criticism or suggestions to the SDD mailing list or to the author.
Gregor Hagedorn; Vers. 2; 10. March 2003
Earlier versions: Version 1