SDD proposal: Free-form text data elements

TDWG working group: Structure of Descriptive Data (SDD)
See also the latest version

Introduction

Occasionally it has been argued that free-form text is unnecessary for the new SDD standard. Although free-form text may be irrelevant in many analysis situations (including phylogenetic analysis and interactive identifications), I believe it is still a valuable feature. This document attempts to provide arguments why free-form text should be supported and analyses which free-form data elements are required.

Free-form text in descriptions

Traditionally, all organism descriptions in biology have been prepared as free-form text. Most authors have realized the importance of consistent and well defined descriptive terminology and have strived to create consistent and complete descriptions in free-form text. This is, however, difficult to achieve. Unsurprisingly, attempts to convert existing descriptions into structured databases usually reveal many shortcomings of current biodiversity descriptions.

Numerical taxonomy, including distance clustering and phylogenetic methods (e. g. parsimony analysis) were important attempts to improve objectivity and consistency of biodiversity descriptions. The great achievement of the DELTA format was to combine categorical information (nominal and ordinal scale) and numerical information (cardinal and interval scale) with free-format text elements. This made DELTA usable for general purpose descriptions as well as for identification and analysis applications.

Some free-form text elements present in DELTA should be reconsidered in the new SDD standard. To improve interactive identification, data analysis, and translation into multiple languages elements like frequency modifiers or comments expressing probabilities/uncertainty should be moved from free-form text to structured data. However, the general arguments for the necessity of free-form text remain valid and such elements should be included in a new standard.

Overview of textual data elements

(The following list considers only data elements which are present in the item descriptions. Many additional name and wording elements are present in the section concerned with the definition of terminology.)

Each of these textual data elements requires a separate discussion.

Free-form text as the only content of a character

The need for free-form text characters arises in the following situations:

Although only in relatively few situations free-form text characters are ultimately the best choice, the presence of free-form text elements greatly enhances the flexibility of applications. The acceptance of a standard requiring a rigorous and laborious definition of terminology by any content author is greatly enhanced, if free-form text characters (and also "additional-states-elements", see below) allow the gradual introduction of a strict terminology, rather than forcing the content author to solve problematic terminology before starting to enter data.

The text in a text-character may be:

More information about the proposal to introduce basic formatting markup into the SDD standard interchange format is available in the document "Character-formatted free-form text".

Free-form text as a user-definable extension to the defined categories

In many cases it is difficult to capture all categories of a property (e. g. "shape"), since some state categories may be rare ("hamate shape") or applicable only to few organisms. Many existing DELTA based descriptive data applications (e. g. Pankey/Pandora, the CSIRO DELTA programs, or DeltaAccess) allow the more or less dynamic addition or insertion of character states while working on item descriptions. This enables the content and terminology author to identify a problem, refactor the terminology, and continue data entry work.

This approach, however, will often be impossible in collaborative working situations. A document-based system requires that only a single person is working on a project at the time of state-reorganization. A database-based system may allow reorganizations of the terminology as long as all users work on the same central database. Still, if distributed copies or database replication is used, changing the terminology becomes increasingly difficult.

Furthermore, in large collaborative projects, it may be desirable to centralize the responsibility for changing the terminology, or changes must be agreed upon after a discussion process.

A frequently requested feature is therefore a free-form text "category" in addition to the defined categories. This approach is frequently encountered in sociological or market research questionnaires and serves to be able to supply rare or unexpected information:
   Profession:
   Farmer
   Teacher
   ...
   Other (please specify): [___ free-form text ___]

Such a feature can act as an important feed-back mechanism about categories that may have been overlooked during the definition of the terminology. Information present in the "Other" free-form text category can be analyzed, and if the terminology is expanded, migrated to new categories.

This mixture is inherently (and not well documented) present in the DELTA format. Any categorical (UM/OM) or numerical (IN/RN) character always can be used as a text character. If character 2 is TE (text) and character 3 is UM (nominal categorical), the following can be coded:

2,<free form text>  
3,<free form text>   (here the text replaces a categorical state)
3,4<free form text>   (here the text is a comment on the first state)
3,4/<free form text>   (here the text is a second state)

Free-form text as annotation of a item specific character state observation

In this case the basic information is already captured in a structured format and only additions or annotations are added as free-form text.

When creating natural language reports, the attributes defined in the general terminology and the item-and-state-specific annotations are combined to produce the report. However, I believe that these two parts should remain decoupled. The BioLink project proposes to make similar elements work closely hand-in-hand with free-form text that can be inserted before or after the state code (as "textBefore" and "textAfter" attributes). Doing this produces several problems. Firstly, it makes multilingual data sets very difficult, because data can not be published without translating every single piece of annotation. This is not the case if the annotations are output in a separate place, where it would be acceptable to have a readable Spanish description, with occasional English annotations in brackets. Secondly, and most importantly, the task of expanding and improving the character terminology, becomes extremely difficult to do, since any change in the labeling or wording of characters and states requires a reassessment and revision of ever item record using this information in combination with a "textBefore"/"textAfter" annotation. Again, this is not the case if annotations are designed as independent data items that have to be independently understandable.

No need for language-unspecific state annotations could be identified. It is therefore proposed that state annotations are always language specific. Furthermore, many content authors request character formatting for this element. It is therefore proposed that state annotations always enable character formatting. If this is considered undesirable, an option could be implemented either for the entire project, or separately for each character. An item- and character specific option would result in no additional data consistency since this would be a individual definition, rather than a definition relating to a class of objects.

In contrast to DELTA and especially the DELTA-2 proposal, the following text features should be rejected:

Proposal

  1. The structure of "text states" should be analogous to special states expressing that a character is inapplicable or unknown.
  2. A single "text state" can be added to the definition of categorical or numerical characters.
  3. A "text character" is a character with only the text state defined.
  4. A "text state" may have a label and wording defined in the terminology (for all items together).
  5. Each state used in an item description may have an annotation that is publicly visible (ReportedNote).
  6. Each state used in an item description may have an an additional free-form annotation that is visible only to the authors and editors of the data set (InternalNote).
  7. Text and annotations may contain limited xhtml markup as specified in "Character-formatted free-form text".

The following constraints are considered desirable to simplify the structural model:

  1. Two or more text states per character are not supported.
  2. A text state provides no annotation in addition to the text, the annotation and the text are identical. This simplifies the model, allowing any free form reportable text information to reside in the same data element.
  3. Annotations applicable to an entire character within an item are not supported. All item description annotations occur on the state level. If a reported or internal note is required for the entire character rather than for specific states, a text character state would have to be added to the character. (Note: to avoid misunderstanding: this says nothing about "character notes" in the character definition applicable to all items together)

Request for discussion

Please send your criticism or suggestions to the SDD mailing list or to the author.

Gregor Hagedorn; Vers. 1; 12. Nov. 2002
See also the latest version



Return to the SDD starting page.

First published 2002-04-25, last update: 2002-11-12.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser