SDD proposal: Free-form text data elements

TDWG working group: Structure of Descriptive Data (SDD)

Introduction

Occasionally it has been argued that free-form text is unnecessary for the new SDD standard. Although free-form text may be problematic or irrelevant in many analysis situations (including phylogenetic analysis and interactive identifications) it is a valuable feature in many other situations. This document attempts to provide arguments why free-form text should be supported and analyzes which free-form data elements are required. For a quick look at the results of the analysis please go to "Proposal" at the end.

Note: This discussion is only concerned with free-form text that may be present in the object descriptions. Many additional label and wording elements are available in the section concerned with the definition of terminology.

Free-form text in descriptions

Traditionally, all organism descriptions in biology have been prepared as free-form text. Most authors have realized the importance of consistent and well defined descriptive terminology and have strived to create consistent and complete descriptions in free-form text. This is, however, difficult to achieve. Unsurprisingly, attempts to convert existing descriptions into structured databases usually reveal many shortcomings of current biodiversity descriptions.

Numerical taxonomy, including distance clustering and phylogenetic methods (e. g. parsimony analysis) were important attempts to improve objectivity and consistency of biodiversity descriptions. The great achievement of the DELTA format was to combine categorical information (nominal and ordinal scale) and numerical information (cardinal and interval scale) with free-format text elements. This made DELTA usable for general purpose descriptions as well as for identification and analysis applications.

Some free-form text elements present in DELTA should be reconsidered in the new SDD standard. To improve interactive identification, data analysis, and translation into multiple languages elements like frequency modifiers or comments expressing certainty should be moved from free-form text to structured data. However, the general arguments for the necessity of free-form text remain valid and such elements should be included in the new standard.

Textual data elements in descriptions:

Each of these textual data elements requires a separate discussion.

Free-form text as the only content of a character

The need for free-form text characters arises in the following situations:

Although only in relatively few situations free-form text characters are ultimately the best choice, the presence of free-form text elements greatly enhances the flexibility of applications. The acceptance of a standard requiring a rigorous and laborious definition of terminology by any content author is greatly enhanced if free-form text characters (and also user-definable state extension, see below) allow the gradual introduction of a strict terminology, rather than forcing the content author to solve terminological problems before starting to enter data.

The text in a text-character may be:

More information about the proposal to introduce basic formatting markup into the SDD standard interchange format is available in the document "Formatted free-form text".

Free-form text as user-definable extensions to the defined categories

It is often difficult to capture all categories of a property (e. g. "shape") in the terminology before starting to work on the descriptions, because state categories may be rare ("hamate shape") or applicable only to few organisms. Many existing DELTA-based descriptive data applications (e. g. LucID, Pankey/Pandora, the CSIRO DELTA programs, or DeltaAccess) allow the dynamic addition of character states while working on object descriptions. This enables the content and terminology author to identify a problem, refactor the terminology, and continue data entry work.

This approach, however, will often be impossible in collaborative working situations. A document-based system requires that only a single person is working on a project at the time of state-reorganization. A database-based system may allow reorganizations of the terminology as long as all users work on the same central database. Still, if distributed copies or database replication are used, changing the terminology becomes increasingly difficult. Furthermore, in large collaborative projects, it may be desirable to centralize the responsibility for changing the terminology, or changes must be agreed upon after a discussion process.

Consequently, a free-form text "category" as a catch-all remainder is an important feature. Such a category ("Other") is frequently encountered in sociological or market research questionnaires and serves to supply rare or unexpected information:
   Profession:
   Farmer
   Teacher
   ...
   Nuclear Physicist
   Other (please specify): [___________________]

Such a feature can act as an important feed-back mechanism about categories that may have been overlooked during the definition of the terminology. Information present in the "Other" free-form text category can be analyzed and migrated to new categories if the terminology is expanded.

This mixture is inherently (and not well documented) present in the DELTA format. Any categorical (UM/OM) or numerical (IN/RN) character always can be used as a text character. If character 2 is TE (text) and character 3 is UM (nominal categorical), the following can be coded:

2,<free-form text>  
3,<free-form text>   (here the text replaces a categorical state)
3,4<free-form text>   (here the text is a comment on the first state)
3,4/<free-form text>   (here the text is a second state)

Free-form text as annotation of a object specific character state observation

In this case the basic information is already captured in a structured format and only additions or annotations are added as free-form text.

When creating natural language reports, the "wording" or "phrasing" defined in the general terminology and the state-specific annotations within the description are combined to produce the report. An important issue is, how closely these two are expected to work together.

The BioLink application proposes free-form text elements that can be inserted before or after the state code (as "textBefore" and "textAfter" attributes)1. These elements have no semantic definition, other than that they work hand-in-hand together with the wording defined in the terminology for characters and states. This produces several problems. Firstly, it makes the creation of multilingual data sets very difficult, because data can not be published without translating every single piece of annotation. This is not the case if the annotations are output in a separate place, where it would be acceptable to have a readable Spanish description, with occasional English annotations in brackets. Secondly, and most importantly, the task of expanding and improving the character terminology is made almost impossible. Any change in the labeling or wording of characters and states requires a reassessment and revision of every object description using this information in combination with a "textBefore"/"textAfter" annotation.

This is not the case if annotations and terminology are decoupled. Annotations should be considered independent data items that have to be understandable without the immediate context of the wording choosen in the terminology. "textBefore"/ "textAfter" elements should not be supported for this reason.

1 The BioLink document "Enhanced Item Descriptions" (publ. 2001) states that "the restrictions on the placement of comments imposed by DELTA will be relaxed so that character values and comments can be intermixed as required". However, the preferred model is currently only slightly relaxed, using fixed "textBefore" and "textAfter" attributes (S. Shattuck, pers. comm. 2002).

No need for language-unspecific state annotations could be identified. It is therefore proposed that state annotations are always language specific.

Furthermore, content authors frequently request basic text formatting in annotations. It is proposed that state annotations provide limited text formatting, see the separate document "Basic text formatting". It may be desirable to allow terminology designers to limit this ability in user interfaces. This could be defined on the project, character, or state level. Currently such an option is not proposed by SDD.

In contrast to DELTA and especially the DELTA-2 proposal, the following text features should be rejected:

Text as list elements

This topic has not yet been sufficiently explored in the discussions. It strongly depends on the availability to validate or pick list information from external data sources, e. g. lists of host names, pathogens, etc. Please do raise the subject in the discussions!


Structural comparison of text elements in DELTA and SDD

Text elements in DELTA / DELTA-2 descriptions:

Character type  Descriptive data

Text
 character
[Comment] A text character may contain any [Comment] combination
of text and [Comment]s
Categorical
 character
[Comment]red[Comment]
[Comment]green[Comment]
Numerical
 character
(minimum)[Comment] 1[Comment]
(lower range)[Comment] 2[Comment]
(value or mean)  [Comment] 5[Comment]
(upper range)[Comment] 7[Comment]
(maximum)[Comment] 12[Comment]

Note that an explicit free-form text category for "Other" is not available. In categorical characters it can be simulated through an "other" category, but this would be treated as analytically meaningful by DELTA applications (e. g. user could be asked in keys: "is it 'other' or not?").


Text elements in SDD descriptions:

Character  
type
Modi-
fiers
  
StateNumerical
data
   Reported 
Note
Internal 
Note

Text
 character
-(text state)-Free-form text may contain
limited formatting, e. g. superscript,
subscript, or italics.
Internal Note
Categorical
 character
usually red-Reported NoteInternal Note
rarelygreen-Reported NoteInternal Note
-"Other:" (text state)-User-supplied categoryInternal Note
Numerical
 character
(minimum)1Reported NoteInternal Note
(lower range)2Reported NoteInternal Note
(mean)5Reported NoteInternal Note
(upper range)7Reported NoteInternal Note
probably (maximum)12Reported NoteInternal Note
-"Other:" (text state)-User-supplied statisticInternal Note

Any element (perhaps with the exception of coding status (= "missing data indicators", = "special states") like "unknown", "not applicable", etc.; this needs to be discussed!) may have an optional reported note. Any element (including coding status values) may have an optional internal note. "-" indicates that the column is not applicable to the state.



Proposal

  1. A single "text state" containing only free-form text (but having no other analytical properties) can be added to the definition of any categorical or numerical character.
  2. The structure of "text states" should be analogous to coding status values (expressing that a character is inapplicable or unknown).
  3. A "text state" may have a label and wording defined in the terminology section (for all descriptions together).
  4. A "text character" is a character containing only a text state (but no categorical, numerical, or statistical states). Special states expressing "unknown", "not applicable", etc. may optionally be present.
  5. Each categorical or statistical state used in an object description may have an annotation that is publicly visible (ReportedNote).
  6. Each state used in an object description may have an an additional free-form annotation that is visible only to the authors and editors of the data set (InternalNote).
  7. Text and annotations may contain limited xhtml markup as specified in "Formatted free-form text".

The following constraints are considered desirable to simplify the structural model:

  1. Two or more text states per character are not supported.
  2. A text state provides no annotation (ReportedNote) in addition to the text. The annotation and the text are identical; all free-form reportable text in object description states resides in the same data element.
  3. Annotations applicable to an entire character within a description are not supported. All object description annotations occur on the state level. If a reported or internal note is required for the entire character rather than for specific states, a "text state" would have to be added to the character (to avoid a possible misunderstanding: this does not refer to "character notes" defined in the terminology!).
  4. Special states may or may not support annotations, this needs further discussion!

This proposal is designed to handle free-form text as a) single child of a character, b) sister or other child elements of a character, and c) child element of states with a minimum of structural complexity.

Request for discussion

Please send your criticism or suggestions to the SDD mailing list or to the author.

Gregor Hagedorn; Vers. 2; 10. March 2003
Earlier versions: Version 1



Return to the SDD starting page.

First published 2002-04-25, last update: 2003-03-10.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser