TDWG working group:
Structure of Descriptive Data (SDD)

Minutes of working sessions in Indaiatuba, Brazil, 14-17. October 2002

(Version 1.0)


Introduction

The meeting in Brazil was a very important step forward in several ways:

Firstly, the considerable time we are investing in in-person discussions (4 days in March 2002 and now again 4 full days, plus many hours alongside the following meetings) seems to be well invested. We are now able to present a "straw man" xml schema model and an associated example document. This "straw man" is intended for testing the concepts with real data. It is expected to change significantly over time, so that investments into implementing code based on it should remain limited. However, in contrast to the schema published after the last meeting, this schema is no longer auto-generated from an example document, but a true manually designed schema. All auto-generation artifacts are now removed, and the model is complete enough to code actual data with it.

For those who do not have am xml-schema viewer, the schema is also available as a graphical presentation (large file, 1.5 MB html-file plus 0.4 MB images!) generated by xml spy. You can click on the elements and types in the diagrams to see their definition in more detail.

Secondly, the development of GBIF is starting to have an impact on our group. GBIF is recognizing the importance of the TDWG SDD standards development for the future of GBIF. Currently GBIF is concentrating on parts of the global biodiversity information infrastructure (catalogue of names and specimen access) not directly affecting out work. However, the next phase of GBIF will concentrate on species descriptions and information beyond names, synonymy, and specimen information. This phase can be jeopardized by a knowledge and development gap, preventing the fast implementation of the second phase. SDD is filling an important prerequisite, which has to be in place when this GBIF phase starts. GBIF will therefore also attempt to support a limited amount of travel to SDD meetings.

Please see also the conveners report given at the main TDWG meeting. It also discusses the state of the development and the relations with GBIF. It is available as a html-document or as a Powerpoint file.

Note on the arrangement of the minutes: The discussions are not always presented in their original sequence. I have tried to rearrange the threads so that related topics are discussed together. Some repetition is still present, and please do alert me if anything has become confusing or misleading through this editing!


Table of Contents

Monday, 14. October 2002
1. Discussion of general architecture
2. Data types
3. Frequency and likelihood statements

Tuesday, 15. October 2002
1. Discussion of data types available to SDD descriptions
  1.1 Categorical data
  1.2 Numerical data
  1.3 Text data
  1.4 Arranging characters in arrays
2. Character hierarchies and character decomposition
  2.1 Revision of concepts proposed in Australia
  2.2 Is autogeneration of character names possible?
  2.3 Displaying multiple hierarchies in a combined tree
  2.4 Reuse of part hierarchy and natural language wordings
  2.5 Sequences of characters
  2.6 Sequences of states
3. Multilingual features: language and expertise
  3.1 Language
  3.2 Expertise Level
  3.3 Further differentiation
  3.4 Central audience definitions
  3.5 Position of natural language wording definitions
4. Exclusive vs. multistate states
5. Multistate in numerical characters

Wednesday, 16. October 2002
1. OpenKey presentation
2. Character attributes to improve identification
  2.1 Reports about some identification programs
  2.2 Which attributes to use?
  2.3 Conclusions about character identification attributes
3. ItemAbundance
4. Character groupings
  4.1 Groupings types
  4.2 Name for "CharacterGrouping"
  4.3 CharacterSubsets
5. Single valued numeric characters
6. Procedural discussions
7. Ordinal categorical types (again)
8. Discussion of state x item versus char x item matrix
9. Combining different data types (categorical, numerical and text) within a character

Thursday, 17. October 2002


Monday, 14. October 2002

Participants

Bob Morris (Univ. Mass., USA)
Gregor Hagedorn (BBA, Germany)
Guillaume Rousse (Paris 6, France)
Johan Duflost (Belgium Biodiversity Information, BEBIF, Belgium)
Mihail Carausu (DanBiF, IT Coordinator, Denmark)
Nicolas Bailly (MNHN, Unité Taxonomie et Collections, Paris, France)
Nozumi "James" Ytow (Uot Tsukuba, Japan)
Patricia Mergen (Belgium Biodiversity Information, Node manager, Belgium)

After a round of introductions and discussion about the 4-day-agenda the following points could be discussed on Monday:

1. Discussion of general architecture

The general expectation that the item description is controlled by a terminology provided in an xml schema. This expectation was repeatedly raised and since it contrasts with the conclusions reached in the March 2002 meeting, it was considered appropriate to discuss and perhaps revise the architectural decisions regarding the general structure of the terminology and item description schemata and documents, to avoid overlooking a possibly simpler solution.

The following summary of the discussion has been provided by Guillaume Rousse:

The different parts of the system are undisputed:

However, problems arise about the relationships between these elements. Two kind of relationships exist:

Depending on the relationships used, two different architecture have been discussed.

1. Dedicated vocabulary is a specialization of general vocabulary

  • general vocabulary is an XML schema
  • dedicated vocabulary is an XML schema inheriting from general vocabulary
  • item description is an XML document complying to dedicated vocabulary as ensured by standard schema validation

2. Dedicated vocabulary is an instantiation of general vocabulary

  • general vocabulary is an XML schema
  • dedicated vocabulary is an XML document complying to general vocabulary as ensured by standard schema validation
  • item description is an XML document complying to dedicated vocabulary as ensured by other kind of validation

Pros:

  • technically trivial
  • easy nesting of dedicated vocabularies extending each other

Pros:

  • no limitation to schema features for validation of item description
  • ability to assign values to descriptors themselves (as reliability), as they are implemented as data

Cons:

  • limitation to schema features only for validation of item description, which could prove insufficient for specific purposes
  • inability to assign values to descriptors themselves (as reliability), as they are implemented as data types

Cons:

  • technically more complex
  • no possibility to nest dedicated vocabularies

Discussion about how to achieve custom validation:

(Guillaume Rousse)

We concluded that xml schema (www.w3.org/2001/XMLSchema) would have to be extended to fulfill the requirements of SDD and that this is not appropriate. The solution selected in March was in the end considered to be likely the best. See also the proposal Document structure of standard for a detailed discussion of the problem.


2. Data types

Note: Since the discussion was continued on the next day and most points were reconsidered then, a summary is documented under Tuesday, below.

3. Frequency and likelihood statements

The differences and relationship between frequency statements ("usually", "rarely") and likelihood statements ("probably") were discussed. Regarding the frequency statements a change of the March 2002 schema version was accepted: Frequency modifiers can now contain direct statement of frequency (LowerLimit and UpperLimit or Value) rather than only references to verbal terminology with a fixed associated frequency range (e. g. "usually" with a defined frequency range).

See also the proposals Certainty modifiers and Frequency modifiers


Tuesday, 15. October 2002

Participants

Arthur Chapman (Australia & CRIA, Brazil)
Bob Morris (Univ. Mass., USA)
Bryan Heidorn (Univ. of Illinois, USA)
Gregor Hagedorn (BBA, Germany)
Guillaume Rousse (Paris 6, France)
Isabel Calabuig (Denmark)
Jim Croft (Australian National Herbarium, Canberra, Australia)
Kevin Thiele (CPITT/LucID, Australia)
Mihail Carausu (DanBiF, IT Coordinator, Denmark)
Nicolas Bailly (MNHN, Unité Taxonomie et Collections, Paris, France)

1. Discussion of data types available to SDD descriptions

1.1 Categorical data

In most characters, it is necessary to score multiple categories ("states") for a given item. If descriptions are based on a single organism or object, some characters could have exclusive categories, whereas other categories could co-occur. However, the main use of descriptive databases is the description of taxa, which are collections (or populations) of objects. This may even be true where original observations are scored for natural history collection material, at least in any taxon where the individual is weakly defined and collection of multiple objects are frequent (grasses, corals, lichens). Consensus was reached that any categorical data type should allow "multistate" use, as it is the case in current DELTA and NEXUS formats.

The discussion then centered around various types of nominal and ordinal data and the need to clarify various types of ordering. The discussion points are covered by two SDD documents prepared after the meeting, please refer to Categorical data types (G. Hagedorn) and xml code to define tree, cyclic, and graph ordering of character states (Bob Morris).

1.2 Numerical data

A) Single dimension:

Two types of numerical data are recognized: cardinal data (counts) and interval scale (= continuous measurement). multiple statistics (mean, range, s.d., s.e. sample size) can be derived from a single variable or dimension (e. g. length), so that each data item is multi-valued. The data can, for example, be reported such as: "(2-) 2.5-3-3.6 (-3.9) sd=1.22 se=1.22".

A discussion about technical representations of integer and real numeric data of different precision was started, but it was concluded that with the currently readily available precision (32 bit integer values, 8 byte double precision real) probably any precision requirement for normal biological data is covered. Implementations of the SDD standard may use higher precision, but are required not to use any lower precision. If an overflow occurs during import, the user must be informed.

The following document discusses numerical character in more detail:data in item descriptions and proposes a detailed list of features and constraints: Numerical data (G. Hagedorn). (Note: see also below for the longer discussion on Wednesday)

B) Finite, ordered set of multiple numeric dimensions:

The set can be ordered (n-dimensional array) or unordered (collection). The set may contain integer or interval data. Each data point may consist of statistic parameters like min, max, mean, etc. Example: Fourier transformation of leaf shapes. No decision could be reached about the importance of supporting such a feature. This needs to be further evaluated, balancing the need for the feature and the burden it places on software developers. See also the following section on "Arranging characters in arrays".

1.3 Text data

A) List-values, esp. from external query source:

Examples: host-plants, or geographical distribution. The importance of this type is that it is neither feasible nor desirable to add all geographical areas, all host plants, pollinators etc., which could be lists of several 10000 names each, in the terminology definition. This is however, the only option in current DELTA programs. Delta-2 proposed the introduction of a "list character". However, it did not address the main problem, since it still required the databases on which the lists are based had to be kept in the DELTA files just like categorical character definitions.

An external list character is similar to a categorical multistate data type where the state definitions are dynamically obtained through querying external data providers. Such a character would have some terminology definitions in the SDD data set (e. g. natural language wording, position in character groupings, etc.) and other parts would be externally managed.

The main problem in defining such a data type in a operational meaningful was will be the definition of data interchange with providers. Once this has been solved the values could be stored as external ID ("foreign key") plus a cache of a human readable description of the item (e. g. a taxonomic name).

B) Free-form text:

The necessity of free form text was confirmed. The following proposal attempts to give an overview of free-form text data in item descriptions and proposes a detailed list of features and constraints: Free-form text data elements (G. Hagedorn).


1.4 Arranging characters in arrays

The following example was created during a discussion how to handle data that have structures like tables, arrays, or ordered collections.

Examples for complex arrangements of atomic characters into character arrays: "coral growth at different depth" or "growth diameter of fungal cultures on Petri-dishes". The latter example (provided by Gregor, I known it is just the fungi which make it difficult...) is shown below, the cultivation occurs on various media (OA = Oat-Agar, MA = Malt-Agar, SNA = Synthetic Nutrient-Poor Medium), at different temperatures (15, 20, 25 °C) and over different time (7, 14, 21 days):

15°C:OAMASNA
7d 8 mm 10 mm - mm
14d 18 mm 21 mm 6 mm
21d 22 mm 40 mm - mm
       
20°C:OAMASNA
7d 21 mm 40 mm - mm
14d 39 mm 80 mm 38 mm
21d 60 mm - mm - mm
       
25°C:OAMASNA
7d 20 mm 38 mm - mm
14d 36 mm 82 mm 31 mm
21d 58 mm - mm - mm

Note that several values are missing. The data challenge would be to record, that SNA has only been tested after 14 days, and that the MA after 21 days at 20 and 25 °C could not be measured, since the size of the Petri dish restricts the maximum diameter to 90 mm.

Interesting may also be the ratio between values, or the ratio of a value divided by a group-defining variable like the number of days in the example above! Example: size measurement:

    length width  length/width-ratio
-----------------------------------------
      12     3           4
       4     2           2
-----------------------------------------
mean:  8     2.5         3
-----------------------------------------

Note that the average length/width ratio is markedly different from the ratios length/width means. This is also discussed in the SDD data challenge Repeated observations of spore measurements. (Note: Bob thinks that there is a method how to calculate the mean of ratios if only the mean of the numerator and denominator is known, this would be interesting anybody implementing such data analysis.)

One way of coding this that was developed during the discussion would be:

<array>
  <dimensions>
    <dimension name="x" type="temp"/>
    <dimension name="y" type="time"/>
    <dimension name="z" type="medium"/>
  </dimensions>
  <values>
    <x>15
      <y>7
        <z keyref="OA">
          <numericaldata>
            <mean>15</mean>
            <samplesize>15</samplesize>
          </numericaldata>
        </z>
        <z keyref="MA">
          <numericaldata>
              <mean>15</mean>
            <standarddeviation>2.3</standarddeviation>
          </numericaldata>
        </z>
      </y>
      <y>14
        <z>30</z>
        <z>80</z>
        <z xsi:nill="true"/>
      </y>
    </x>
  </values>
</array>

The most basic question here is, whether to support these as explicit n-dimensional array types (i.e. allowing a character to have such a type), or as a secondary arrangement-view of flat characters (i.e. each character would be numerical in the case of the growth diameter)?

n-dimensional arrays would be specified in the item description rather than in the terminology. Advantage: logical arrangement, possibility to enter 15 days rather than 14 days without setting up a new character. Disadvantage: typing OAA instead of OA would not be constrained by terminology. Some information is cell specific, e. g. identification reliability, and for some purposes the processor needs to define a display-label for cells.

Would it be difficult to validate the arrangement or could different items have erroneous arrangements? Probably yes.

Result of discussion on array data: currently support for arrangement facility in the terminology section (in line with character grouping/character tree facility).


2. Character hierarchies and character decomposition

2.1 Revision of concepts proposed in Australia

(see also the minutes of working sessions in Australia (11-14. March 2002)

The same data should be presentable under different concepts:

The part hierarchy is the most frequent starting hierarchy. However, many examples exist, especially in small organisms with little structure, where the part hierarchy is rather meaningless. Even in higher plants a description of chemical assays performed at different parts of a plant would be more sensible to analyze under a methodological view, than under a part hierarchy.

We discussed whether certain hierarchies should be special as being "defining hierarchies"? Can any character be fully decomposed into part, method and basic property? The result was that this needs at least one additional dimension. The three main decomposition hierarchies (part, method, and basic properties) are not sufficient to fully define a character (i.e. any character are not unique in their combination of the values of these 3 hierarchies). Example for which this is the case are:

2.2 Is autogeneration of character names possible?

The discussion then tried to determine whether such an extra, auxiliary "decomposition" dimension would be useful. One criterium is, whether it is possible to avoid having to define flat-list character labels in addition to the tree node labels in hierarchies or groupings (the flat list labels are required to allow, for example, a display of characters ranked/ordered by usefulness for continuing the next identification step (as used in the Intkey "BEST" command).

Could a character satisfactorily inherit inherits all the labels from the "defining" hierarchies?

- plant
  - leaf
    - surface 
       - color (charid =10) // [full flat character name:] "leaf surface color"
                            // [generated character name:] "plant leaf surface color"
                                                           "couleur de la surface de la feuille de la plante"
                                                           "Farbe der Blattfläche der Pflanze"
       - color (charid =11) // [full flat character name:] "leaf surface color (defined by color standard)"
                            // [generated character name:] "plant leaf surface color standards"
    - edge
      - roughness (character id=2) // [full flat character name:] "leaf edge roughness (as seen by hand lens)"
         o smooth
         o rough
      - roughness (character id=3) // [full flat character name:] "leaf edge roughness (as observed by SEM)"
         o smooth
         o rough

Conclusion: Yes It is possible to generate a unique char. label from either a preferred hierarchy (and adding a character specific suffix label) or from a set of defining hierarchies (part, method, basic property + x) which in itself guarantee a unique label. However, these labels are not likely to be informative and truly appropriate in different languages. A builder application could well decide to create flat character list labels on the fly as the user creates them in tree view, using some algorithm. It would be important, however, to allow the editor of the terminology to redefine these labels.

2.3 Displaying hierarchy from multiple character groupings in a single combined tree view

Two characters:
  id=2: leaf edge serration (as seen through hand lens)
  id=3: leaf edge serration (as observed with SEM)

Tree based only on part hierarchy:

- plant
  - leaf
    - edge
      - roughness (character id=2)
      - roughness (character id=3)

This not very appropriate, the processor should get a differentiation from the methodological tree, joining the two trees at the end:

- plant
  - leaf
    - edge
      - hand lens characters 
        - roughness (character id=2)
      - SEM observation
        - roughness (character id=3)

2.4 Reuse of part hierarchy and natural language wordings

Part hierarchy is often the primary hierarchy and rather complicated to define, so designer does not want to do it more than once. It is therefore desirable to have wordings for different purposes within a single hierarchy. Should the same kind of keyref/key mechanisms be used, but with predefined key tokens?

Similarly, the natural language wordings are difficult to define. Although it is desirable to allow multiple definitions within a project, it would be desirable to be able to reuse them. No solution was proposed.

2.5 Sequences of characters

The sequence of characters is important for various presentation issues: Reporting of terminology, creation of dynamic editing interfaces, or generation of natural language descriptions. DELTA supports only a single sequence of characters, and DeltaAccess does the same. Users of DeltaAccess have repeatedly asked for an option to define a different sequence for a particular purpose, which is currently not possible. SDD proposed in the meeting in Australia to use the character grouping as the place to store character sequences. This decision was confirmed during this meeting. Furthermore, it was decided that the sequence of characters in the flat character list should be considered irrelevant. Providing sequences in both places would lead to difficult decisions during export or re-import that can easily be avoided. The character grouping feature should be the only place to express the sequence of characters.

During importing, the state, item, and character group sequence (sequence of instance data in the xml document) should be considered relevant, whereas the character sequence should be considered irrelevant. This implication needs to be stated separately, since it cannot be expressed in xml-schema (xml-sequence or -choice are syntax of elements, not of repeated occurrence of the same element!). In a relational database the sequence of records is by definition not maintained by the database engine and can be optimized at any time. Therefore, a database implementation has to create separate attributes for the relevant sequences, which will not appear in the SDD xml-schema.

2.6 Sequences of states

Do we need more than one sequence for the output of states, for one purpose states a - b - c, for another purpose b - c - a?

First: more than one order-by definition in terminology. Example (by Kevin): geographical distribution, report may either be ordered alphabetically or by geographical sequence. The importance of providing this feature could not be assessed. The point should be rediscussed once the picture of ordering options in character groups is more clear, perhaps state orderings in trees comes in without much extra cost...

(Editors Note: The topic of state state ordering is related to a discussion about the exclusion/inclusion of characters and states in subsets. Subsets are currently defined through character groupings, which define the both a set of character included and an arrangement (including sequence and hierarchy of characters). If states are subject to subset definitions, it may be logical to define state sequences in the hierarchy as well. This may be a more consistent place anyway, but it is the same dilemma as with characters, insofar as that a default sequence should have to be defined only once, not forcing the designer to constantly groom all multiple hierarchies...). Please read "Character and state subsets" dealing with this question, we urgently need a discussion about this!

Second: one order-by definition in terminology, but different ordering in specific descriptions. One object description is "leaves serrate or dentate", another object description "leaves dentate or serrate". This feature is present in standard DELTA, which always defines the state sequence in each individual item. This is, however disadvantageous, if the state terminology is redefined and a new default ordering is defined. Conclusion: An Optional state sequence order may be provided in SDD, but no definite decision was taken. [Later addition: See "Scoring sequence of states in descriptions" for a proposal regarding this point.]


3. Multilingual features: language and expertise

The point of language and expertise markup was discussed extensively. Picking up from the discussions in March 2002, where it was concluded that language and expertise variants should NOT be limited by the SDD standard, it was realized that in fact both language and expertise level must be machine-readable to allow interoperable applications. If, for example, the definition of the expertise level is accessible only to human readers of the same culture, it is difficult to develop generalized applications.

3.1 Language

The xml:lang designation already includes language variants like Swiss German or US-English. We agreed that this should be an enumeration based on the value domain of the xml:lang attribute, e. g.:
   de
   fr
   en
and optionally cultures and language dialects like:
   en-us
   en-uk
   en-au.

3.2 Expertise Level

The expertise level, however has no accepted standardization. As a minimum requirement it was discussed that the ordinality and end points of an expertise scale need to be defined. Jim Croft proposed to define expertise level from 0..1 with an unlimited number of possible values in between. As a disadvantage is was realized that although this allows great freedom of expression, it also places a heavy burden on the designer in requiring him or her to pick an exact value with little guidelines to do so. Furthermore, the decision would have to be made for every expertise-dependent language element, e. g. each human readable character, state, modifier, etc. label in the terminology.

The design would be greatly simplified by having fixed, named categories to which language elements are assigned (or for which language elements are created with a given expertise in mind). One possible set of categories was: Untrained, Trained, Advanced-Student, Expert.

It was concluded that between 3 and 5 fixed categories for expertise were generally sufficient for the purpose of interoperability, provided that the designer of the terminology can further differentiate its purposes in a human readable way. It was therefore concluded that the design should:

3.3 Further differentiation

The last point – avoiding limitations – was crucial in the decision of the previous meeting not to enumerate possible language and expertise levels. If an expertise level "general public" has been designed, it should still be possible to define more than one general public linguistic set for US-English, perhaps one for East-coast farmers and another West-coast city dwellers. In some places you may use the term "thorns", in other places "sticker". Or one audience could be "medical personal" identifying poisonous plants, another audience "farmer" identifying weeds.

If language dependent elements are defined only through language, the selection could be fully supported by the application and each wording, label etc. could be identified by two attributes (xml:lang and ExpertiseLevel). However, with the introduction of freely definable multiple instances for a single language (including culture and dialects), a mechanism is required to identify these definition uniquely and to provide human readable unique labels in multiple languages.

One suggestion for the combination was "applicability" = which language/idiom to choose for which audience. There was however strong opposition and it was felt that applicability is not intuitively understood. The workgroup agreed instead to call the combination of language, expertise level, and label "AudienceDefinition".

3.4 Central audience definitions

The existing solutions known to the meeting are all limited to support the xml:lang attribute alone (e. g. the xml schema annotation elements). From the standpoint of xml processing, it would be advantageous to have the xml:lang directly in each language dependent xml elements. However, two points were relevant that this was not decided:

First, any application or generator would probably centralize the definitions anyhow. Asking the biologist who is creating biological description for each wording (e. g. an annotation of a item description element) to choose from a language pick list, an expertise level pick list and yet another differentiating pick list calls for data entry errors and would be considered tedious by the designing biologist. Also, the differentiation by label is hierarchically dependent on the other selections.

Secondly, it was concluded that in most cases the language dependent elements should be treated together in a language dependent container, marked with the language or audience definition and containing elements like Label, Abbreviation, GlossaryEntry, etc. Modeling the language definition in the container would already run against standard expectations "xml:lang present on element itself".

If the three attributes were treated like independent data elements where every combination is possible, it would force SDD conforming applications to allow this and create problems in recreating the desirable centralized audience combinations. Therefore audience definitions must carry a human readable label, an xml:lang attribute and an ExpertiseLevel category, and the key of these definition is used in the remaining xml documents to differentiate between different audiences. This was agreed by all those present at the discussion.

A separate section called AudienceDefinitions at the start of the SDD terminology was reserved for the purpose of defining audiences. Language and expertise level are controlled enumeration, but multiple audience may have the same language and expertise level, differentiated only by their freely definable additional label.

If for processing purposes it is considered important, derived informational attributes (e. g. LanguageCache and ExpertiseLevelCache) could be added to each language dependent elements. These elements could be skipped by any conforming data reader, which would be only required to respect the information in the audience definitions, thus preventing problems with inconsistent markup.

Note: In the discussion itself no final name was found for language set collections and individual containers. Initially they were called "Internationalization" and "Language", which nobody was, however, very happy with. In fact, language, culture and expertise have very little to do with Nations, and the "Language" container does not contain a language. In later post meeting discussions between Bryan, Gregor, Bob and Kevin the terms "LinguisticSets" and "LinguisticSet" was preferred. Please propose more adequate names if you know any!

3.5 Position of natural language wording definitions

Kevin suggests to move natural language wording entirely from character definition into tree character nodes! This opens the option to actually store multiple different natural language definitions. Previously the SDD model kept natural language in the flat character list.

Note that in the current SDD proposal, the tree is one form of character grouping and the only place where the sequence of characters can be defined. Character sequence and natural language wordings are highly interdependent. The proposal to remove all natural language attributes from the flat character list and move it into character grouping was accepted.


4. Exclusive vs. multistate states

This applies to categorical and occasionally to numerical characters. The important distinction is, whether a character can have multiple states concurrently in a single organism or only in a population / a set of objects classified together. Example:

Currently there is no mechanism in SDD to distinguish these cases in the item description. No decision could be taken on the question whether it would be useful and worth it to provide a data element allowing to make this distinction. Perhaps more useful on item x char. level than in character definition (= terminology)?

The question remained open, where to distinguish these cases. A restriction on the terminology level, defining a character as having only 0 or 1 states almost always fails, at least if classes like genus descriptions are to be covered as well. Also, even with within-organism multistates, it is likely that the more species are treated in a descriptive data project, the more constraints have to be lifted because at least a single organism is found where states can co-occur.


5. Multistate in numerical characters

It is possible to have multiple measurements of the same variable in objects from the object or class currently being described.

Current DELTA allows this, and offers no method to prevent all numeric character from being used as potentially multi-valued sets of measurements (note also that the distinction between Integer and Real numeric – which can be defined in the character definition – has no consequences):

  IN: 1/3/5    IN: (2-) 2.5-3-3.6 / (2-) 3.5-4-6 (-9)
  RN: 1/3/5    RN: (2-) 2.5-3-3.6 / (2-) 3.5-4-6 (-9)

The syntax DELTA numeric multistate is not supported by DeltaAccess, since the intention of DeltaAccess was to provide statistical and numerical analysis, which are very difficult to run with undefined numeric multistates. The most important requirement would be that some semantic information should be provided whether the repeated values refer to repeated measurements of the same variable or two differentiable entities which are only closely related. Examples:

In the first case statistical measures like mean or ranges can easily be calculated and be used as the basis for further analysis or data retrieval (original repeated observations can be stored separately.) However, in the second case it is very important to understand what the meaning of the first, second, etc. part is, otherwise statistical analysis become impossible.

Several solutions to store data about differentiable entities in the SDD standard are conceivable, none of which is fully satisfactorily:

A special case of the differentiable entity problem is again that the original observation is undifferentiable (length of stem leaves), however that as soon as data are collated (e. g. generating a genus description (= class of things) from multiple species descriptions) a discontinuous range results. For example, leaf length may be "2-5 or 9-11 cm". The discontinuity may be very useful for identification purposes! Note: DELTA can code the situation as "2-5/9-11" with discontinuous ranges. However, testing seems to reveal that at least Intkey in fact then generalizes the range during import to "2-11".

Another problem:

Example where multi-character is especially unsatisfactorily for identification: Rust fungi have up to 5 different spore types in a single taxon. Each spore type is differentiable with detailed study and knowledge about the life-cycle of the specific rust (requiring an identification). In the description the 5 measurements are usually kept separate and require multiple characters. A numeric multistate would not be a good solution, because that would not allow to define the required different names for each spore type. However, during identification even a trained user will find it often difficult to impossible to decide which spore type of an unknow rust fungus is at hand. One solution for this would be a mapping of multiple characters to a "fuzzy character" during identification.


Wednesday, 16. October 2002

Participants

Alexandre Marino (CRIA, Brazil)
Arthur Chapman (Australia & CRIA, Brazil)
Bob Morris (Univ. Mass., USA)
Bryan Heidorn (Univ of Illinois, USA)
Donald Hobern (GBIF program officer, Denmark)
Douglas Holland (Missouri Botanical Garden, St Louis, MO, USA)
Greg Whitbread (Australian National Herbarium, Canberra, Australia)
Gregor Hagedorn (BBA, Germany)
Guillaume Rousse (Paris 6, France)
Humberto Navarro de Mesquita Je (CRIA, Brazil)
Ingrid Koch (CRIA Unicamp, Brazil)
Jim Croft (Australian National Herbarium, Canberra, Australia)
Johan Duflost (Belgium Biodiversity Information, Belgium)
Junko Shimura (National Institute for Environmental Studies, Tsukuba, Japan)
Karen Wilson (Royal Botanic Gardens, Sydney, Australia)
Kevin Thiele (CPITT/LucID, Australia)
Marinez Ferreisa de Sigueira (CRIA, Brazil)
Mihail Carausu (DanBiF, Denmark)
Nicolas Bailly (MNHN, Paris, France)
Patricia Mergen (Belgium Biodiversity Information, Belgium)
Ricardo Scachetti Pereira (CRIA, Brazil)
Renato de Giovanni (CRIA, Brazil)
Rafael Luis Fonseca (CRIA, Brazil)
Sidnei de Sousa (CRIA, Brazil)

Agenda

1. OpenKey presentation

The OpenKey presentation by Bryan Heidorn is available as a powerpoint file. (Note: See also his other talk given at TDWG on Sunday, 20th. Oct.: Search Features over Semi-Structured Taxonomic Documents.)

Comparison of his work with the March 2002 preliminary SDD model. Some differences:
Richer taxon information, like ABCD
For states:
 - SynonymousTerm
 - BroaderTerm
 - NarrowerTerm
 - RelatedTerm

Kevin: synonymy is transitive, which is not necessarily appropriate! Example:
- ovate (2-dimensional)
is synonymous to:
- egg-shaped
which is synonymous to:
- ovoid (3-dimensional)
which, however, is not synonymous to:
- ovate (2-dimensional)!

Gregor: For general purposes (including translating specific expert terminology into broader terms for public identification interfaces) specific mapping of states to multiple states is desirable. The mapping should be n:m to also allow mapping of states frequently misunderstood, improving identification error tolerance. This mapping would be directional, and allow only one step (i. e. the search engine does not automatically searches for transitive states). Would this solve the problem? Discussion will be continued when state mapping is discussed at a later meeting.


2. Character attributes to improve identification

Which additional data elements are necessary or useful to guide identification programs in their choice of the best character to answer next?

2.1 Reports about some identification programs

CSIRO DELTA
LucID 3 (reported by Kevin)
XPER (reported by Nicolas)
DeltaAccess (reported by Gregor)

2.2 Which attributes to use?

Discussion about which attributes are desirable to cover additional information about characters. There was a general consensus that an unspecified "weight" should be avoided and that each attribute should be defined based on a definition independent of the methodology in a specific application. It should be possible to obtain the values from in a "questionnaire" style from biologists unacquainted with any actual processing program.

Terms discussed were: IdentificationReliability / Ambiguousness / ObservationRepeatability / Ease of use during determination. No conclusions could be reached. Consensus exists that "DiscriminativePower" should not be stated, but rather calculated in an algorithm similar to CSIRO-DELTA's "BEST" algorithm. Note: the interpretability score of a character state as being possibly misinterpreted is related (handled through modifiers).

Donald Hobern: 3 things exist:

Introduce a "UsabilityRanking" of all characters? Or "Confidence": how much you can trust the character (this is a probably a summary statement!)

Problem of tags versus characters (LucID): Could this be solved by introducing some concept of "hidden" management characters? Characters could be excluded from interactive key or natural language report. In most cases this can be achieved by omitting them from the character grouping used for this purpose, but the 100 worst weeds may pop up in interactive identification if a BEST next character search (which is independent of the tree) is performed. No conclusion found.

A character may be difficult to observe in the field with a hand lens (but still valuable), but very easy to observe in the laboratory with a stereo microscope. Separately perhaps: IdentificationAccessibilityInField / IdentificationAccessibilityInLaboratory?

One problem is the ability of a user of an identification package to make the same observation as the builder of the item description data. Rather than reliability the concept should perhaps be names RepeatabilityEstimate? This is taxon dependent in most cases. Example "Number of spines somewhere on a fish": "3 large spines" reliable and highly repeatable, but "27 tiny spines" easily misinterpreted.

Repeatability also depends on ExpertiseLevel

Scope: set of taxa to which this reliability figure pertains? Global or subset of taxa! Problem: Reliability is a global feature of the entire character, i. e. it applies to all items! Perhaps Reliability needs to be defined together with item groups?

Gregor: scope problem can perhaps be alternatively solved by scoping to project, but providing the planned mechanism to inherit characters/state definition from a global master project, but overwriting repeatability etc. for the current scope.

2.3 Conclusions about character identification attributes

... Decisions needs to be postponed until general item scoping mechanism has been defined.

Note: Scoping mechanism should not tie the terminology section to a given set of item descriptions, i. e. if a character scope is defined only for a subset of items, this should be declared primarily in the item descriptions, not in the character definition. This is necessary to all enable architectures where a central terminology should be used by item descriptions in different places, possible in multiple projects. At least no project-specific item identifiers (like and item key or ID) should be present in the terminology. Generic item identifiers (Genera, family names) would be possible however, as long as they automatically apply to all item descriptions under different management.

See Ratings for identification suitability for an ongoing discussion of the material that has been discussed in Brazil the first time.


3. ItemAbundance

Similar to the character information discussed above, some additional information may be added to items. The only data item identified in the discussion (besides the character scoring, which is the base of any item identification) is the Item abundance, a statement about the relevance, weight, etc. of an item relative to other items.

The element is present in the DELTA ITEM ABUNDANCES directive and the ItemAbundance attribute in DeltaAccess. LucID does not support this. In DELTA the values have no semantic definition, but are defined based on the results obtained in the CSIRO programs. From the DELTA user guide, 4.10 (1999): "This directive specifies the abundances or weights of the items. The interpretation of the abundances or weights depends on the program for which they are intended. Generally, items with high abundances or weights will be given emphasis in some way - for example, they tend to come out early in keys."

XPER also uses frequency of item in an area, which is similar to DELTA's item abundance statement.

Gregor: It is generally questionable, whether the most frequent species should appear first in a given key. Even in keys or interactive identification aimed at the general public the most frequent organisms will be rarely sought, since they are well known to experts and amateurs alike. If anything, item abundances are useful to put really rare species more towards the end of the key. Everything else should probably be treated identically.

Are sufficient frequency informations available at all? Kevin: Often not, but e. g. in the case of medical diseases it is known to great deal of precision.

The usefulness of Item Abundances strongly depends on the area scope. They become increasingly less useful as data sets are becoming global. Item Abundances can therefore be modeled more appropriately as a distribution character with a frequency modifier:

Character distribution:
  Germany: frequently present
  France: rarely present
  UK: absent

Note that the SDD frequency model supports verbal statements with an associated frequency range as well as exact statements of measured frequency (compare Frequency modifiers). This should provide for any functionality that can be expected from a separate ItemAbundance data element.

Conclusion: Support of ItemAbundance is rejected for the SDD model. See Discussion of Item Abundances for a separate documentation of the conclusion. The document repeats the arguments summarized above.


4. Character groupings

4.1 Groupings types

Besides the "typification" (TYPE element in the March 2002 SDD schema) of character groupings (especially trees), character groupings may also be audience specific. One hierarchy may be intended primarily for school-children, another for farmers. Unfortunately, whereas in the first example the tree usually depends only on expertise, but not on language, the example of the farmer is language specific (the linguistic register and concepts specific to farming may be used in the tree.

Proposal Gregor: separate between three different issues: a single character grouping type (method, parts, basic property groupings), character grouping purposes (interactive identification, guided key-builder, natural language report, default user interface, etc.) and audience tagging as audiences, use Audience definitions instead of types in the character groupings. Advantage: Character groupings can be defined for a specific audience, e. g. a subset definition for school children, using the standard audience mechanism.

Action in schema: 1. Types element with repeated Type elements eliminated, now only a single "type" in an attribute of CharacterGroupDefinition, specifying a Is-A statement about the character grouping. The enumeration for type was not discussed in Brazil. Note Gregor: I added a restriction to six types to the schema: "BasicProperties", "ObservationMethodHierarchy", "PartHierarchy", "FlatSubsets", "UserDefinedHierarchy", and "CharacterArray" to the straw man schema, but the issue of types needs to be discussed in the group on the next meeting!

2. A Purposes/Purpose element combination was added to the schema. The number of purposes is restricted by the schema, see the schema documentation. Note that the purposes mechanism is exclusively intended for interoperability, so that user definable extensions are have no value. Extensions required by application developers should be communicated to SDD and will be incorporated into future versions of the standard.

3. The attribute "default" (identifying the CharacterGroupDefinition as a "DefaultGrouping") present in the March 2002 SDD schema was dropped in favor of multiple, differentiated default purposes. The value of the purpose annotation is in fact restricted to declaring defaults and no mechanism exists any longer to declare another hierarchy as intended for natural language reporting, but not as default.

Note Gregor: are we confusing the issue of default and purpose here? On the one hand it may be useful to select any character grouping for a reporting or editing purpose, on the other hand the designer may want to declare several hierarchies as being designed and tested for Natural language reporting, or interactive identification. This issue probably needs revision on the next meeting!

4.2 Name for "CharacterGrouping"

Revision of calling the character arrangement facility (responsible to select character subsets, arranging the sequence of characters, and displaying characters optionally as trees or in 2-dimensional tables) "CharacterGroups":

Other options are CharacterViews, CharacterCollection, CharacterSet. Although a set can be ordered, the term "set" in general implies a non-ordered arrangement. "Collection" would be appropriate, in object-oriented modeling includes specifically ordered sets and trees. Gregor prefers views, but no one else. Decision: leave it at "CharacterGroups".

CharacterGroupItems do not need name attributes present in previous model. Instead each needs label elements in multiple languages!

4.3 CharacterSubsets

Clarification: As decided in Canberra, character groupings should be usable both as flat definitions and hierarchical character arrangements. Flat character group definitions are created exclusively to define a character subset (a filter to display only selected characters). However, it is desirable that applications allow the use of any character grouping as a filter. If the user selects "root" from the part-hierarchy of a plant, it should display any character that is present in the tree starting at "root". The procedure to collect all characters from a tree into a set to be used for filtering is considered relatively simple and easy to implement.

Idea from Donald Hobern: Two kinds of subsets may be desirable.

An example for the latter would be a key to pathogens, that also includes non-pathogenic species which may have been initially mistaken for a pathogen. The identification quality will generally be better, if leads to these potential identification targets outside the original scope of interest are offered.


5. Single valued numeric characters

(See also the short discussion on Tuesday above.)

Numerical values in descriptions have both a data type (e. g. integer or real numeric) and a property defining their "statistical semantics". Examples are: "mean", "sample size", "minimum", "standard deviation", "5%-confidence interval". The definition of a numeric state within a character adds a further semantic level, which is very similar to the case of categorical states. However, the generic statistical semantics need to be expressed separately.

First a discussion was held about an appropriate name subsuming the examples given above. Proposals were:

The "Encyclopedia Britannica" uses the term "statistical measures" for "mean", "standard deviation", etc. The measure concept does not fully apply to single values or sample size, but the similarity was considered sufficient.

Note: Confidence intervals or percentiles require two values to be expressed, these are treated as two separate measures in the SDD model and it is up to the application to handle the relationship where necessary.

Decision to set up the MeasureDefinitions section in the Terminology section, providing a global declaration of available statistical measures. For each character 0-n measures can then be enabled, restricting the availability to selected ones. In contrast to DELTA, SDD will be able to constrain the use of measures to, e. g., only minimum and maximum.

This could look roughly like:
character...
  <Measures>
    <Measure key="00001" keyref="mean"/>
  </Measures>
character...

Within the item description, the measure is then identified through a keyref to the key of the definition within the character (e. g. keyref="00001"), not towards the key of the global definition (e. g. keyref="mean").

Editorial Note: During the discussions we assumed that the measure usage in the item description would look like: "<Measure keyref="00001">23.8</Measure>". However, this would only work with languages using the same number-formatting rules like English, since xml numbers must follow these rules. To allow other languages, and support the intended usage as a markup language, I consider it preferable to move the numeric value into an xml attribute. See "Xml representation of numeric values" for a short discussion.

Regarding the relation between the SDD standard and a terminology instance, the following options where discussed:

The last was favored, but deferred until xinclude is widely available to allow testing.

A basic problem is again, that xml does not provide for any mechanisms to inherit content data from a standard list. The SDD group would like to predefine certain measures to maximize application interoperability, while still allowing the user to extend this list. Thus predefined and user definable key value are the same domain referenced by keyref statements.

Discussion how many measure to predefine in the standard and how much to rely on extensibility. Extensibility of measures should be planned, but has not been further discussed. Providing 1% steps in percentiles and confidence intervals would be possible, but bloat the measure list, probably unnecessary. A compromise was searched, and we tried to generalize this [Editor's note: the method of generalization was changed and improved in SDD schema versions Brazil, Paris, and Lisbon, see there.


6. Procedural discussions

Bob suggests that we set up a subgroup how to externalize/modularize parts of the standards

Bob proposes administrative structure:
- open issues
- current practices
- desired practices (things people want to do)
- observations

Open issues marked out and list what can be done without resolving them

Nobody volunteered to do that work. Gregor thinks that this task may be easier with a more mature draft than the current versions, when it is clearer which material has to modularized/externalized.

Donald Hobern proposes an object model overview to help understanding the general concepts. Gregor will attempt to provide some general starter documentation for people entering the discussion new. [@ Not yet done! @]


7. Ordinal categorical types (again)

Bob shows solution for organizing character states: solution for tree, cyclic and directed acyclical digraph (representation works with cyclic graphs as well). See also his proposal Character state graphs.

Consensus to represent everything with graph, import routines in application need cycle detection (an advisory attribute in the standard could be set erroneously, so process has to do it anyway). Bob recommends an advisory attribute nevertheless.

(Note GH: Proposal accepted, but details of how and where to implement it in the terminology section of SDD not resolved. Therefore not yet present in the released "straw man" versions.)


8. Discussion of state x item versus char x item matrix

Some current practices:

Things can partly be converted. Problems exist with special states (see [@ sorry, not yet finshed @]...) that have character scope. Example: the single statement "unknown" or "not applicable" is in DELTA considered to make a statement about all states within a character.

A comparison of DELTA and LucID coding highlights some of the problems, especially when data already exist and new character states are introduced:

DELTA  LucID
16,1/2/3  11100
16,2/1/3  11100 * state order can not be expressed here!
16,? * DELTA problem?  11?00

What happens if new states are added to the definition?

DELTA  Lucid
16,U  ?????
16,U remains unchanged
if state 6 is added
 Lucid changes to ?????0 - should this be "??????"
16,1  10000
16,1 remains unchanged
if state 6 is added
 100000 or 10000? Kevin is uncertain. Which is more appropriate?

Attempt to clarify (Gregor): Two kinds of "characters" exist. In some characters (flower color) if the character has ever been observed, new states can be added to definition and new states can be scored as absent.

In some characters (enzyme test with various enzymes, geographical distribution "pseudo-characters", etc.) the new enzyme should be scored as unknown! These characters are perhaps not "true" characters, but a collection of individual characters that are so closely related that they are considered a single character.

If this would somehow be defined as a property of the character, what should happen in the case of new items? If a character distribution character has with 200 states and a single state is scored present, should all other change from unknown to absent? Perhaps:
  when character is "touched" for the first time
  200 color names -> change other states to "0"?
  200 enzymes -> leave other states at "?" ?

A character definition attribute for state-default could be: "default (ask first time)", "(unknown)", "0", or "1". Should this be defined a single time for all states within a character, or on a state-by-state level?

Note Kevin: LucID does not support a true "unknown" statement so far, new states are uncertain, not unknown! In current SDD model state-level probability can be expressed with a modifier, but state-level unknown is not possible!

Note on database implementations The structural issue of state versus item matrix must be distinguished from implementation issues. Keeping information for each state may be easier to handle in certain situations and lead to faster responses to queries. However, this can be achieved both for a structural state or character matrix model. For example, in a relational database implementation it is possible to code at least one special state by not adding any state record to the character. For example, DeltaAccess adds explicit unknown ("U") state records if the character has been checked, but adds no state records to characters that have not yet been checked. This model simplifies refactoring of the character definition, but has drawbacks for interactive identification purposes. The reason is that the "not yet scored" should be treated as an implicit unknown state, and is for identification purposes equivalent to the explicit unknown state. However, in a relational database searching for missing matches (unscored characters) is a relative slow (expensive) operation. It would however be possible to rewrite DeltaAccess so that states are inserted and managed rather than left out.

The discussion of state or character matrix touched again on the issue of how unique (locally/globally, if globally GUID model or URN model) the keys for states should be. This is especially important as soon as data are federated rather than shared, as in Bryan's example. The discussion on this was postponed until the next meeting and discussion documents should be prepared (a contribution by Gregor is now available: GUID Usage)


9. Combining different data types (categorical, numerical and text) within a character

Related to the problem of state and character matrix (and thus the question of "what is a character") is the question whether a character should possibly have more than one data type (this is a continuation of the discussion on Monday about data types).

Example 1: mix of ordered and unordered states in a character:

States may have two kinds of ordering defined in the terminology definition Two types of ordering are (partly) supported by DELTA: The sequential output definition for reporting purposes and the definition of an inherent ordering. The latter distinguishes the data types ordered categorical data (DELTA "OM") and unordered (i.e. nominal scale) categorical data (DELTA "UM"). The distances between ordered states are defined by the distance along the ordering graph, whereas all unordered states are equidistant.

The "orderedness" of a character could be defined on the character level, i.e. as a data type for all states within that character (as discussed in the session on Tuesday, see above). However, it may be desirable to mix ordered and unordered/nominal data in a single character. This would require an "ordered"-attribute on each state:

  1   ordered
  2   ordered
  3   ordered
  4   (unordered)
  5   (unordered)
  6   ordered

The calculation of distances is not overly difficult. It follows the rule that distances within the set of all unordered states and between any ordered and any unordered state are 1, and distances among ordered states are calculated like normal ordered states distances. The above would result in distances (1,2) = 1, (1,3) = 2, (1,4) = (1,5) = 1, (4,5) = 1, (1,6) = 3, (3,6) = 1, etc. and (1 < 2 < 3 < 6), (4 <> 5), (1 <> 4), etc.

However, no good real-world example was brought up in the discussion where a clear-cut separation between truly ordered and truly unordered states is present. Most examples are complex state sets where some states are known to be ordered and others are too insufficiently known to postulate an ordering hypothesis. These can be handled directly as generalized ordered graphs, since unordered states can be expressed as a graph where any state is connected to any other state:

     a--b
     |\/|
     |/\|
     c--d

Furthermore, to support the phylogenetic NEXUS data standard, which is one of the aims of the group, branched order state "trees":

       3
       |
     1-2-4

must to be supported anyway, requiring a complex distance matrix or other means of defining the graph.

Consensus was reached not to define ordered/unordered on a state level, but rather to distinguish on the character level as unordered/nominal, "linearly ordered", or complex ordered (with a graph definition, supporting quasi-nominal subsets of characters.) The "linearly ordered" value would be provided as a short-cut to minimize the burden of the developer of the terminology, avoiding the need to define state graphs too frequently. No need was perceived to introduce special types for cyclic ordered or other special forms of state ordering.

Example 2: mix of numerical and categorical states:

In four items the following statements shall be coded:
   item 1: 2-8 bristles
   item 2: More than 20 bristles
   item 3: More than 100 bristles
   item 4: Less than 3 bristles

Two solutions are possible to code such data:

a) Create a character that contains a mixture of numeric and categorical states, like
Character:
   Min (numerical)
   Max (numerical)
   "More than 20" (ordered state)
   "More than 100" (ordered state)

b) Code the items as value ranges leaving either the minimum or the maximum empty, i.e. code 2-8, 21-(empty), 101-(empty), and (empty)-2. Now provide a rule-based mechanism similar to the DELTA OmitValues directive to suppress the meaningless part, rendering the other part in natural language using "up to"/"at least" or "more than"/"less than".

Example 3: mix of categorical states and text information:

This form is frequently encountered in questionnaires and serves to be able to supply rare or even unexpected information:
Profession:
   Farmer
   Teacher
   ...
   Other, please specify: [free-form text]

Given the ability of descriptive data management programs to update the terminology on the fly it is not absolutely required for single authored projects. However, in collaborative projects the decoupling of terminology revisions and item description data entry becomes very important.

This mixture is inherently (and not documented) present in the DELTA format. Any categorical (UM/OM) or numerical (IN/RN) character always can be used as a text character. If character 2 is TE (text) and character 3 is UM (nominal categorical), the following can be coded:

2,<free form text>      
3,<free form text>   (here the text replaces a categorical state)
3,4<free form text>   (here the text is a comment on the first state)
3,4/<free form text>   (here the text is a second state)

A proposal how to handle mixtures of text and other data types is: "Free-form text data elements" (G. Hagedorn).


Finally, on Wednesday the important topic of coding unknown / uncertain / inapplicable through special states was started. The discussion was continued on the next day.


Thursday, 17. October 2002

Note: Originally the SDD meeting was planned to not overlap with the TDWG-ABCD working group meeting. However, ABCD needed more discussion time than originally foreseen and had to meet on Thursday as well. Consequently, many participants of SDD also participating in ABCD had to split their time or be entirely absent from this day.

Participants

Arthur Chapman (Australia & CRIA, Brazil)
Bob Morris (Univ. Mass., USA)
Bryan Heidorn (Univ of Illinois, USA)
Greg Whitbread (Australian National Herbarium, Canberra, Australia)
Gregor Hagedorn (BBA, Germany)
Jim Croft (Australian National Herbarium, Canberra, Australia)
John Aubry (American Museum of Natural History, New York)
Kevin Thiele (CPITT/LucID, Australia)
Nicolas Bailly (MNHN, Paris, France)
Yde de Yong (Fauna Europaea, Netherlands)

The day was primarily used to flesh out some two things that had already been partly discussed: The special character states and the coding of numerical states (mean, standard deviation, etc.; the discussion of this started already on Wednesday). The result of this work can be found in mostly in the "Brazil straw man" xml-schema and the associated example document (especially the list of statistical measures planned to support in the initial version). The discussion of special character states is summarized (and revised) as a separate document on SpecialStates/CodingStatus (read the version written after Brazil, or the revised version).

The name for the strict container (as opposed to the lenient container "NaturalLanguageDescription") was changed from "FormalDescription" to "CodedDescription". All members present agreed to this and considered it desirable.

Finally, Nicolas Bailly volunteered to organize a meeting in February 2003 in Paris! [Editorial note: minutes are now available!]


Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)



Return to the SDD starting page.

First published 2002-10-19, last update: 2003-11-11.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser