TDWG working group: Structure of Descriptive Data (SDD)
The formatting of text may occasionally have a semantic meaning rather than being a question of rendering or appearance. For example, the symbol "m2" is different from "m2", and "m2" is ambiguous and should be considered an orthographical error.
A reference citation may be:
"<citation><author>Smith, J.</author><year>2000</year> <title>Development of S2-proteins during G1-phase</title></citation>".
In this example the superscript and subscript formatting occurs within the title and is part of a technical terminology. Such formatting may indicate a special meaning in some terminology, e. g. multiplicity in chemical formulas like "H2O". In general, however, the meaning of superscript or subscript formatting will not be readily apparent (as can be seen in the example above) and researching it for each article that is to be entered into a citation database will be impractical. The parsimonious solution is therefore to simply preserve selected character formatting statements in these data elements. It would thus be desirable to use mixed content for the title: "<title>Development of S<sub>2</sub>-proteins during G<sub>1</sub>-phase</title>". Alternatively, if mixed content should be avoided at all cost, the markup could be: "<title><normal>Development of S</normal><sub>2</sub><normal>-proteins during G</normal><sub>1</sub><normal>-phase</normal></title>".
Furthermore, content authors may want to emphasize words within a free-form text comment. An example would be "... this is <em>not</em> identical with the situation in ...". Already DELTA content authors where complaining about the lack of formatting support (DELTA 2 proposed adding a general rtf-style formatting support). Most content authors are used to working with word processing applications and expect similar features from descriptive data applications. For the new descriptive data standard to be accepted, it is important that applications are able to satisfy the need of content authors, without sacrificing the benefits of structured and analyzable content that separates semantic markup from formatting issues. If an application does not provide such facilities for free-form text, authors of data sets are likely to revert to email-style character formatting like "... _not_ ...", "... *not* ...", etc. This is undesirable.
In practice, two more character formatting elements are frequently requested: Underlining for emphasis and small caps for authors. I believe that underlining for emphasis should be discouraged, since standard typesetting rules discourage the rule of underlining in proportionally spaced fonts. Small caps are not directly supported in html/xhtml, but are part of the cascading style-sheet specification (although not widely supported by current windows browsers). Only very rarely requested formatting abilities are: text font and background color, language for multilingual text.
The problem formulated here is also encountered in related xml schema, e. g. that xhtml or MathML is contained in RDF/RDFS. RDF defines an attribute (rdf:parseType="Literal">) to inform a parser that is should stop parsing. See Resource Description Framework (RDF) Model and Syntax Specification, chapter 7.5. "Values Containing Markup". However, I would rather prefer to constrain what can be used in formatted text, rather than throwing it completely open (which complicates consumption and analysis applications).
The proposed data type for character-formatted text should be restricted to clearly defined elements within the schema. It would, for example, be undesirable to allow formatting markup in a list of geographical area names. The data type "simpleFormattedText" is therefore proposed for the ReportedNotes element present in each character state element within each item, and as one of several free-form text data types. Other free-form text data types should be limited to string data (no mixed content).
It is understood that the information in data elements containing character formatting is primarily intended for human consumption, rather than for machine processing (except perhaps full-text-indexing operations). In most situations, the formatting will be directly passed through to a report. Nevertheless for processors who may want to gain additional insight into the content of simpleFormattedText a semantic definition of the formatting elements should be provided (or it should be indicated that the semantics are ill-defined). An example of such a semantics-aware processor would be an full-text indexing engine, that could react differently to italics or superscript text.
The provision of character formatting markup enables content authors to abuse the facility to format the entire element (e. g. all reported comments are manually enclosed in emphasis tags). For superscript, and subscript this can be prevented by disallowing the entire element text content to be enclosed by the markup. However, this restriction may be unsatisfactorily in the case of emphasis or taxon markup. It is quite possible that occasionally the entire title of a reference or content of a reported comment consists of a taxon name. A specific rule for superscript and subscript seems to be a disproportionate effort.
The following formatting elements derived from xhtml are supported in the data type "simpleFormattedText" (= character formatted free-form text):
| Description | XML Element | Semantics |
| subscript | <sub> | Belongs to the adjacent word and is necessary to understand its meaning. Include the markup when indexing. |
| superscript | <sup> | As above. |
| emphasis | <em> | The phrase is considered to be more important that other phrases. Do not include the markup in indexing, but the content may be wheighted in indexing. |
| strong | <strong> | As above, the emphasis may be even stronger. |
| italics | <i> | The semantics are poorly defined. Use for italicized text not identifiable as emphasis or a taxon name, e. g. where latin phrases have to be italicized. Remove markup for all semantic processing. |
In addition, the following semantic elements extending xhtml should be supported in "simpleFormattedText":
| Description | XML Element | Semantics | Rendering recommendation |
| taxon name | <taxon> | allows to distinguish between italicized taxon names and other, unidentified italic elements | italics |
| taxon author | <taxonauthor> | Author names in the taxon name string | none or smallcaps |
| citation author | <citationauthor> | Author names in bibliographic citations | none or smallcaps |
Some formatting can be applied to formatted text, creating nested markup:
<em>e<sup>x<strong>-y</strong></sup></em>.
However, taxon, taxonauthor, and citationauthor should contain no further formatting. Note that the use of the mixed content model is not the only solution, the problem is further discussed in "Use of the mixed content model for basic text formatting"
Proposed text (first draft):
Simple formatted text allows only a restricted set of character formatting operations. No paragraph formatting (like multiple paragraphs, numbered or bulleted lists) are available. Introducing these formatting elements would severely limit the options for different reporting styles. The main character formatting elements are: subscript, superscript, emphasis, strong, and italics. Use "emphasis" for words or parts of a sentence that should receive stress while reading the text. Emphasis will often be rendered as text in italic font. Use "strong" for very strong emphasis, e.g. keywords that should be immediately visible when looking at a paragraph. Strong should be used sparingly, since it will disturb the normal flow of reading. Strong will often be rendered as bold-face text. Underlining and bold are not available, use "emphasis" and "strong" instead. Use italics for text that should be marked up in italics by convention, e.g. Latin words in English text. Use "taxon" for text that should be italicized because it contains the name of a biological taxon.
Applications may support formatting markup or not. A simple implementation could leave the markup (<em></em>) visible and provide buttons on the user interface to switch markup during editing, or perhaps provide a validating parser. Another implementation could hide the markup and provide a WYSIWYG-style editing interface.
To provide interoperability between multiple applications dealing with descriptive data, however, the markup should not be removed during import procedures in applications that both read and write the standard. Furthermore, the markup should by some means remain transparently visible to the user during editing in applications that support editing.
Thus, any application can support the formatted text by simply treating it as unformatted free-form text, displaying the markup directly. If a stripping of the markup is desired to improve the reports, the stripping should occur during reporting rather that during data import.
Applications are not required to validate the editing actions on individual instances of free-form text. However, if possible any application writing files in the SDD standard should validate the resulting file and draw the attention of the user to invalid parts.
The application must either silently escape the necessary character entities to be able to embed both <i> and < or >, or inform the user that this has to be done manually.
Is it possible or meaningful to allow any hierarchy or formatting markup, while preventing repeated use of the same element? <em>cm<sup>3</sup></em> should be possible, but <em>some text <em>some text</em> some text</em> is not desirable.
Gregor Hagedorn; Vers. 2; 10. March 2003
Earlier versions: Version 1 (2002)