TDWG working group:
Structure of Descriptive Data (SDD)

Minutes of the SDD meeting in Lisbon, Portugal, 20-26. October 2003

(Version 1.0)

The meeting in Lisbon at the TDWG Annual Meeting, 21-26 October 2003 was the longest face-to-face meetings we ever had. We had a preparatory meeting and three full sessions, plus extra sessions on Thursday and Friday (several "SDDlers" volunteered to carry on). Our warm thanks go to Pedro Fernadez and the Instituto Gulbenkian de Ciência, Oeiras, Portugal for providing us with excellent facilities, including the extra space for the unscheduled discussions!

At the end of the meeting we decided that we would intensively work on the schema for another month and then call a public review of the schema (edited by Gregor), accompanied by a primer document (Kevin) and some basic reference documentation (Gregor). Due to Bob's greatly appreciated insistence we did keep the deadline. The schema itself is available here:
  XML schema, version 0.9 (in technical xml format, released 1. Dec. 2003)
  Technical instance document, version 0.9
To view the schema linked above you need to have a schema editor. However, we provide schema documentation formatted as web documents which can be viewed in any browser.

Important note: to avoid that these minutes remain unfinished until the next meeting (which happened to the Paris minutes), I am fairly rigorous in leaving out details from discussions. Consequently, this document is a somewhat unbalanced and rough representation of the discussions... I welcome any additions and corrections where my own account is considered insufficient!


Monday, 20. October 2003

This was a preparatory meeting between Gregor, Bob, and Kevin, trying to sort out points that have been raised in the run up to the meeting and during Kevin's preparation of a primer manual. Some of the points had already been discussed between Bob and Kevin during the testing of the SDD Wiki, which Bob had set up for us. Much of the discussion centered on which elements may be optional and which have to be required. We largely agreed in making things optional wherever possible to simplify the job for SDD generators and to minimize the burden on developers implementing a simple SDD export. However, technical considerations, especially the desire to validate the object references inside the document (e. g. to acertain that only states defined in the terminology may used in a description) using schema identity constraints occasionally limit our choice about which elements may be optional. Furthermore, some type names (which are visible only in the schema and not in instance documents) were changed and the functioning of the NaturalLanguageDescription markup rediscussed.

Tuesday, 21. October 2003

Participants

Ario, Arturo H.; artarip-at-unav.es
Aurala, Markus (Finnish Mus. Nat. History) markus.aurala-at-helsinki.fi
Brugman, Marc (ETI) mbrugman-at-eti.uva.nl
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Carausu, Mihail (Danish Biodiversity Inf. Facility) mccarausu-at-zmuc.ku.dk
Chapman, Arthur (CRIA) biodiv-at-achapman.org
Correia, Ana Isabel (Botanical Garden Lisbon) aicorreia-at-fc.ul.pt
Dias, Nuno (Dept. Botany, Coimbra, Portugal) nunogdias-at-ci.uc.pt
Esprito-Santo, Dalila; dalilaesanto-at-isa.utl.pt
Flemons, Paul (Australian Museum) paulf-at-austmus.gov.au
Gallut, Cyril (LIS, Univ. Paris 6) gallut-at-ccr.jussieu.fr
Graf, Mickal (GBIF Sweden) mickael.graf-at-nrm.se
Rousse, Guillaume (LIS, Univ. Paris 6) rousse-at-ccr.jussieu.fr
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois) pheidorn-at-uiuc.edu
Kirk, Paul (CABI) p.kirk-at-cabi.org
Knüpffer, Helmut; knupffer-at-ipk-gatersleben.de
Miller, Chuck; chuck.miller-at-mobot.org
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Sales, Fatima (Dept. Botany, Coimbra, Portugal) fsales-at-ci.uc.pt
Sheppard, John; john.sheppard-at-mobot.org
Souza, Sidnei (CRIA) sidnei-at-cria.org.br
Thiele, Kevin; K.Thiele-at-cbit.uq.edu.au

Agenda

Talks by Kevin

The first talk was "Federated Description Services and the Library of Life - or - What can we do with SDD anyway?" by Kevin Thiele. It was prepared with funding from the Centre for Biological Information Technology (University of Queensland) and the Moore Foundation.

The second talk introduced the draft version of the SDD primer by Kevin Thiele. [Publicly available since December 1, 2003.]

Discussion of current version of schema

First Gregor presented an introduction to the "Symbols used in XML Schema diagrams" prepared to help in interpreting the schema diagrams we are using.

The following overview and discussion of the schema using xml-spy first got caught already at the project definition, where much need of discussion was detected. Already at the previous meeting in Paris we confirmed the importance of the project envelope and that is must be capable of expressing authorship and intellectual property as well as revision status. The RevisionStatus element (tentatively introduced after Paris by Gregor) was rediscussed and revised. The version and the date elements were restructured and extended (see the new version of the schema).

ProjectDefinition: GloballyUniqueName

Much time was spend discussing the proposed GloballyUniqueName element and its semantics. The element is not intended as a "name" or "title" of the project (which is defined in the block of language specific project definition elements), but as a technical name that allows to prefix any local identifier within a project (e. g. the key of a character state) so that the combined identifier becomes a globally unique identifier. Beyond being unique, this identifier should further never change after it has been introduced (unique immutable value).

Alternative models of achieving global uniqueness were discussed:

  1. GUID very long numbers (16 Byte = 128 bit, e. g. "{4B0AF0B9-8683-4D4E-9538-7AC49D9D1767}"). These can be automatically generated by software without user input. They are guaranteed to be globally unique.
  2. Registry: The project author could select a concise project name (e. g. "German microfungi") and register it in a registry on a first come first served basis.
  3. URN (universal resource names): The project author could select the urn incorporating her or his institutional url and adding a unique path within that (example: http://bba.de/hagedorn/coelomycetes).

Each solution has some advantages and disadvantages. A disadvantage of GUID numbers is that they look unwieldy and are difficult to type. This prevents a potential use as a way of reference e. g. character definitions in print. A sentence "the character definition is essentially identical with that of German microfungi/8136382" is technical, but still readable, whereas the same sentence as "{4B0AF0B9-8683-4D4E-9538-7AC49D9D1767}/836382" is unusable for human communication.

The registry solution primary needs a dependable service that runs it. Furthermore, the applications used by biologists must know about this service and point the project designer there. Questions: Would GBIF run such a registry? Could a UDDI registry be easily expanded so that it includes a uniqueness check for some attributes registered, and rejects registration if another UDDI entry already uses the same value? If GBIF creates a repository/UDDI for SDD project names the use of these names could be made a requirement by SDD.

Finally, the unregistered URN solution requires an informed choice, which may be difficult. Many authors know or expect they will change institutions, or they may retire. If a project continuous to use the name of the institution where it was initiated, some institutions may generously allow that or even consider it good advertisement. However, especially government agencies usually forbid such use and may actually enforce that (the reason is that they are politically motivated not to be associated with anything they can no longer control).

An important goal of SDD is that complex projects (including federated databases) should be provided with the necessary features, but that this does not hinder the creation of simpler projects (be it Ph. D. student, amateur, or a school project). Unfortunately, whereas in the case of GUID numbers an application can shield the user from difficult decisions, either the registration or the wisely chosen URN solution require significant understanding from the person setting up a new project. This may well lead to inhibitions in using an application. Applications based on SDD should remain usable for school children!

Unless there is a good reason to use a GBIF/UDDI registry solution, the currently proposed solution is: The GloballyUniqueName element is optional and should not be used until it is needed to link to objects within a project from other locations on the web. Once GloballyUniqueName is needed, it should be of the URN type. The application should present the user with a recommendation like the following example:

"Recommendation: Avoid choosing simple names that are likely to be created multiple times ('plants', 'French bees', etc.). Authors working at research institutions and expect to continue to do so, may use institutional-URI/personal or team name/project label. Note that this is only an identifier and does NOT help to locate any real resource on the web."

[The discussion is also summarized in the SDD documentation on "GloballyUniqueName"]


The language specific elements in project definition were skipped to move forward. SDD should be compared with the ABCD schema; perhaps some decisions on copyright and intellectual property can be synchronized.

Type of key/keyref attributes

Before lunch break we discussed the value type of the document key/keyref connections within a project. Although it is currently not possible to put constraints on relations between xml documents or projects, this is expected to become the case in the future (using xInclude or other mechanisms). The keys of characters and characters states would then no longer be a matter of a local document that may change at any time, but a globally accessible resource. As a consequence, it becomes essential that keys are stable (immutable) over time as well as unique at any given point in time.

Gregor proposed to use integer numeric keys instead of the pseudo-semantic text strings used in the discussion so far. The problem is that terminology often evolves (especially so in the first minutes after it has been created!). However, most applications that try to hide the key creation from the designer will have to derive the key by an abbreviation mechanism from the first label created.

As a consequence semantic keys either have to be curated so that the keys remain synchronized with developments in the terminology (and consequently the keys will not be stable), or they will soon be out of sync with the semantics they describe (and the semantic readability is likely to be more confusing than helpful).

The requirement that keys should not be semantically meaningful can easily be accomplished with text strings as well ("x3shjr"). However, we did agree that integer numbers are more natural to use for this purpose and will guide developers in avoiding semantic keys.

The only problem is that integer keys, given the amount of key/keyref object relations present in the SDD schema, make it significantly more difficult to debug an application. Gregor presented a proposal to supplement the key/keyref with a "debugref" label as an optional attribute. This can be automatically created and filled by an xsl script. [In the meantime a first version of this has been written by Jacob Asiedu. Also, in post-Lisbon discussions on the WIKI, Bob suggested to add a "debugtext" on the keyed objects itself. This has been added to the schema; we now have a "debugkey" on the keyed objects, and a "debugref" on the reference objects (those having a keyref).]

Please refer to Object-relational design of SDD for further information and discussion of this point.


In the afternoon we had two long discussions that were both continued on the following days:

Revision of audience specific containers

It was decided that while we may continue to use the general key/keyref naming pattern of the attributes defining object relations (rather than calling them charkey, charref, modifierkey, modifierref, conceptkey, conceptref, etc.), at least references to audiences should be an exception. The key attribute is called "audiencekey", the references simply "audience". This modeling pattern in analogous to the use of xml:lang.

We all agreed that the term "TextualDefinition" tentatively introduced in the Paris schema was unsatisfactorily. The ensuing discussion is difficult to report in detail, but the following conclusions were drawn on the next day:

(In discussions after the meeting, the following further changes where tentatively made:)

GlossaryEntry is changed into an independent flat collection, combining entries for hierarchies, characters, states, etc. These refer to a GlossaryEntry by a key/keyref method. The main advantage is that it will be possible to exchange terminological knowledge by means of SDD document, and import Glossaries that are independent of character definitions when starting new projects. The original solution of placing GlossaryEntries in a 1 : 1 relation was intended to have the advantage of enforcing well defined terminologies, because a definition would have to uniquely refer to a single object. This constraint is now no longer in force and it is easier to produce degenerate terminologies, where a concept is used in multiple states, rather than revising the character list to use a single generic state multiple times. However, we believe that this trade-off is necessary to improve the federation and collaboration functionality of SDD.

The naming of Labels/Label was revisited. It seems confusing that some things in a pluralized collections are actually multiple objects (States, RDF collection type set), whereas others are only representations of a single thing (Labels/Label, Statements/Statement, RDF collection type alt). It would be preferable to be able to distinguish between set collections and audience-alternative collections. The following options were discussed:

We researched the current patterns used when using xml:lang or html lang attributes. In most cases the preferred pattern is to simply enumerate the singular elements without providing a separate container element. This is equivalent to Kevin's proposal. However, Bob strongly argued in favor of providing container elements in the schema, to simplify the mapping of elements to classes in object oriented languages, or tables in a relational database. With the current version of SDD we follow this route, and decided to test using Label/Representation as a general design pattern in the schema. Representation is generally used now, examples are ReportedNotes/Representation, Statement/Representation, QuestionText/Representation, LanguageSpecificData/Representation.

Note on LanguageSpecificData: During the discussions in Lisbon this was modeled as being only language-, not audience-specific. Although there is no need for audience-specificity, after Lisbon this was changed to make the whole model more consistent.

Taxonomic hierarchies in SDD

This complex and long discussion was held in three parts on three different days. To simplify the presentation, the material from Tuesday is presented together with that of Wednesday, see there.


Wednesday, 22. October 2003

Participants

Armstrong, Kate; k.armstrong-at-rbge.org.uk
Brugman, Marc (ETI) mbrugman-at-eti.uva.nl
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chapman, Arthur (CRIA) biodiv-at-achapman.org
Gallut, Cyril (LIS, Univ. Paris 6) gallut-at-ccr.jussieu.fr
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois) pheidorn-at-uiuc.edu
Morris, Robert (U. Mass. Boston) ram-at-cs.umb.edu
Souza, Sidnei (CRIA) sidnei-at-cria.org.br
Thiele, Kevin; K.Thiele-at-cbit.uq.edu.au

Agenda

Audience specific elements revisited (decisions)

(Reported under Revision of audience specific containers on Tuesday)

Name of elements for Taxa and Specimen in the schema

Kevin proposed to change the "bio-centric" names Taxa and Specimen to more general terms. SDD can easily be used outside the biological knowledge domain and using concepts restricted to biodiversity research may prevent developers from understanding the concepts. For example, we expect that users from medicine and other diagnostic sciences have highly similar requirements. In fact, DELTA approaches have been used on archeological pottery, musical instruments in museums, etc. Furthermore, even in biology the classes identified may occasionally not be identical with biological taxa, as it is the case when diseases are identified (e. g. a single organism may have multiple disease names on different host organisms). Kevin proposed to drop the distinction between Taxon and Specimen and call both "Entity".

Gregor argued against that, pointing out that a reverse confusion may be just as detrimental to the success of SDD. Biologists who manage the development of biodiversity software may be confused about the meaning of a generic and already highly used term (as in database entity-relationship modeling). The terms taxa and specimens are easily understood and easy to explain, even to non-biologists. Furthermore, the distinction between taxa (= classes) and specimens (= objects) is supported by them having different data elements in SDD and different applicability. Onyl a specimen can be identified to a taxon, in guided keys only taxa but not specimen may be keyed out, etc.

Amongst the developers present at the discussion a general consensus was that the biological terms are better avoided and that more general terms should be used in the schema. A major argument was that the schema addresses more programmers than biologists, and the terms should be intuitive to programmers. A test vote resulted in Class to be used for taxon and Instance for specimen.

Over the following days the discussion returned to this topic and the following additional changes were introduced:

1. The term instance was considered as being easily misinterpreted. It suggests that a specimen can be "instantiated" from classes and that they can only exist as an instance of a class (taxon). This is not necessarily the case, since taxa are human-made concepts, and individual specimens are natural objects. Furthermore, specimens may not yet be identified and consequently are not assigned to a class name. A better solution seemed to be the pair Class-Objects for Taxon-Specimen, which was chosen.

2. The term Entity was introduced as a new section to contain classes and objects (instances). Until now both Class and Object elements where in collections inside the Resource section. They are in fact derived from the same base type ResourceConnector as Publications, MediaResources, etc. However, Kevin argued that Classes and Objects are special in that they are essential in defining a description, they are the thing or concept that is being described. This special place should be emphasized by placing them in a more central section for pedagogical reasons. Gregor preferred to maintain the structural simplicity of having all external interfaces/connectors as resources in a single section, but was over voted.


Taxonomic hierarchies in SDD, inheriting and deducing descriptions

(The discussion about Taxon hierarchies from Tuesday was continued. To simplify the presentation, the results have been combined into the following report covering both days. Note that a related discussion further continued on Thursday, see Discussion of class hierarchy)

Introduction

One of the primary tasks of the SDD group was to allow descriptive data to be inherited up and down taxonomic trees. This means that, for example, a genus description can be automatically created as the generalization of all species descriptions in the data sets. Conversely, subspecies should inherit data specified only at the species, unless they are "overwritten" by data in the lower taxon.

The aim of SDD is to provide a generalized knowledge inference mechanism on taxonomic trees to code data as exact as possible. Descriptive statements deduced from family descriptions should not be copied down to each species, even though they have never been verified there (and may well no apply to a number of species in the family). In DELTA it is not possible to dynamically inherit data from lower taxa (they can be compiled into new descriptions, however) and to inherit data from higher taxa (except for special "variant taxa" mechanism provided for infraspecific taxa or races).

Furthermore, several management tasks are envisaged on taxonomic trees. For example, DELTA's project-wide "Implicit state" mechanism should be replaced by statements that are local to selected taxa, and it should be possible to "scope out" characters for specific branches of taxonomic trees.

(Note: the topic has been discussed both on Tuesday and Wednesday, both discussions are presented together here)

Dual definition of taxonomic trees?

The topic outlined above has been discussed repeatedly in SDD meetings. We found, however, that defining taxonomic trees inside SDD conflicts with the aim that we should not duplicate data in SDD that are better handled elsewhere: SDD should not become a taxonomic name database, specimen collection database, citation database, image database, etc. So far we have solved the problem using "resource connectors", connecting to outside data sources, but providing a cache as well as a replacement object for objects unavailable in the service (e. g. a newly published taxonomic name).

The discussion in Lisbon centered on the following alternatives:

Gregor and Kevin proposed to implement the third option by directly placing the descriptions objects not in a flat list (as proposed so far), but in an xml-tree. This xml tree could have named as well as unnamed nodes. It would simplify the task of inference and management operations, since these no longer have to use key/keyref relations, but can directly follow the path of descriptions.

As soon as we considered the situation in federated projects (descriptions are served from multiple servers), however, it became apparent that this design is very problematic. If we would want to limit the inference to the relation between infraspecific taxa and species (similar to variant taxa in DELTA), a federation may still be manageable. However, SDD explicitly does not want to limit taxon tree management and inference mechanisms to specific cases. In this case the integration of multiple sets of descriptions, each of which is organized into an operational hierarchy, turns out to be not manageable.

Consequently, the taxon tree definition (= class hierarchy) has to be defined separate from the descriptions. We could not find convincing arguments whether this tree should be considered an externally updatable resource or an independently defined SDD-specific class hierarchy. For the version 0.9 we decided to try implementing a resource based model. This should allow us to better test the problems that may be associated with this.

Should specimen (Object elements) be included in the tree?

Next we discussed two proposals how the relationship between specimen (Object elements) and taxa (Class elements) should be defined. Kevin proposed to integrate both Object and Class into a single tree. The identification of a specimen occurs by means of searching for the next higher node that is a taxon. Gregor proposed to keep specimens and taxa separate and limit the tree to taxon hierarchies. The specimens are separately identified directly in the specimen list within the Resource section.

The following white board images illustrate the two concepts. In both diagrams trees are drawn on the left side, the resource lists in the middle, and descriptions are illustrated as green squares on the right side. Note that, unfortunately, the colors red and blue are used with opposite meaning in the two diagrams (very unprofessional of us...).

First Kevin's concept of a single resource list containing both specimens and taxa, differentiated perhaps by a boolean data element symbolized by red [specimens] or blue [taxa] check marks. Both specimen (red) and taxa (blue) are in a single tree (left), and descriptions have only a single connector to the single resource list. Note that it is possible have specimens that could be identified only to a higher taxon, e. g. family or order:
Class Tree (Kevin's proposal)


Next Gregor's concept of separate resource sections for taxon names (red, top) and specimen (blue, bottom) and a tree limited to taxa. The specimens inside resources point to taxa to which they are identified (this may be higher taxa as well). The descriptions on the right side have interfaces to either specimen or taxa, or both. The top description denotes an unidentified description.
Class Tree (Gregor's proposal)

The beauty of Kevin's the concept is its structural simplicity. It requires only a single resource list, no extra identification elements within that list and a single tree. As a result it seems easier to implement if the software largely ignores the differences between taxa and specimen.

The advantage of Gregor's proposal is that it moves more information into areas that can be covered by external resource services. For example, a typical taxonomic hierarchy obtained from a biodiversity service will be restricted to taxa. Inserting specimens into the tree will make it very difficult to update at a later time. Similarly, specimen identifications typically belong to the domain of collection databases (see, e. g. the TDWG ABCD proposal). It may be desirable to update them by querying the resource service. In that way it is possible that later work on specimens in an institutional collection database can improve the quality of the descriptive project like a family revision.

A problem with Gregor's original proposal was that is has the specimen identifications directly on the descriptions. Descriptions of specimen also carry a taxon name reference, which in effect becomes the specimen identification. This has been in the SDD model through all previous meetings. It has the following advantages and disadvantages:

We decided on a model close to Gregor's proposal, but with the modification to remove the double interface of descriptions to both specimen and taxa. It is now a required choice, either a taxon reference of a specimen reference must be given for each description. The identification of specimens occurs in the Resource specimen list (Resources/Objects) by reference to the taxon names (Resources/Classes). We are very interested to get feedback on this proposal, perhaps experiences from people trying to implement the model!

(This report summarizes the discussions on Tuesday and Wednesday; see also the discussion on Thursday, Discussion of class hierarchy!)


Search for intuitive terms for Local and Global States

In the Brazil schema we used "global" and "local" states to distinguish between states reusable in multiple characters and those defined only in a single character. However, already in Paris it turned out that this naming convention is immediately understood only by programmers, but not by biologists. Biologists tend to think when talking about global state definitions about truly world-wide standards of how a color should be named. Such a standard is desirable and it is hoped that SDD paves the way for terminology that is accepted by more than single projects, the technical meaning of the global states in SDD is currently limited to "within-project globality".

The following alternatives were discussed:
  Shared states (considered possible)
  Project-wide states (exact, but verbose and difficult to use during discussion)
  Generic states (implying reusability, and not being limited to an operational definitions)
  [Concept states would also be possible, this term was not discussed in Lisbon]

[Editor's note: during the discussion in Lisbon we still had a flat list of state sets defined directly in Terminology, intended for reuse in multiple characters. The discussion reported here was about the appropriate term for this entity. After Lisbon we moved the list of reusable states into the Concept trees. However, the discussion about an appropriate name is still relevant, with the exception that we could now also use the name "ConceptStates". If you think that ConceptStates (i. e. encompassing property states for color, shape, as well structural types like berry, capsule, etc. is kind of fruit) is more appropriate, please raise an argument on the SDD Wiki.]

The schema version 0.9 currently used the term GenericStates for states defined at Concept nodes within the concept trees. Please raise any objections as soon as possible on the WIKI, so that we may reach a stable terminology soon!


Additional (unscheduled) session Thursday, 23. October 2003, 9-13:00

Participants

Brugman, Marc (ETI) mbrugman-at-eti.uva.nl
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chapman, Arthur (CRIA) biodiv-at-achapman.org
Gallut, Cyril (LIS, Univ. Paris 6) gallut-at-ccr.jussieu.fr
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois) pheidorn-at-uiuc.edu
Morris, Robert (U. Mass. Boston) ram-at-cs.umb.edu
Thiele, Kevin; K.Thiele-at-cbit.uq.edu.au

Agenda

Discussion of class hierarchy (continued, review of ad-hoc proposal)

Based on yesterday's discussion, Gregor presented a proposal how to integrate the class hierarchy in the SDD schema. Placing the class (= taxonomic) hierarchy inside the Entities element was agreed. The class hierarchy as a whole is seen as an external resource that may have an ID on a Provider servicing e. g. taxonomic hierarchies. (In practice, however, it is foreseen that the hierarchy will usually be created inside the SDD model for the coming years.)

An intensive discussion about whether we need multiple hierarchies ensued. We see practical need for multiple trees in relatively few cases and a number of problems complications if multiple trees are supported. However, in the case of large projects, project members may find it difficult to agree on a single tree.

Multiple trees cause conceptual problems with the descriptions of higher taxa. The descriptions of higher taxa are separate from the tree topology only when the descriptions exclusively depend on outside data sources (from literature, from specimens, etc.) and are viewed as a (possibly contradictory) knowledge base. If the descriptions are created or revised by the authors and viewed as homogeneous work, descriptions and hierarchy depend on each other. In this case the data are repeatedly analyzed under a specific class hierarchy (taxon hierarchy) and contradictions are expected to be removed or annotated. Viewing the descriptions under a different class hierarchy will usually introduce contradictions.

However, the problem arises only because different higher taxon concepts are traditionally sloppily identified by the same name. Traditional biology and taxonomy does not sufficiently identify that if one author has a wide concept and another a narrow concept of "Liliaceae", these families should not have the same name. Ideally, all higher taxon would be used with a "sensu" or "secundem concept-author" suffix.

If we realize that the description of "Liliaceae" under one taxon hierarchy can not possible be identical with the description under another hierarchy, and if we consequently name the taxa differently, it would be no problem to have in the list of descriptions two descriptions:
  description 1 refers to "Liliaceae sec. author1"
  description 2 refers to "Liliaceae sec. author2"

Now tree 1 could use the "Liliaceae sec. author1" which is connected with the appropriate description 1, and tree 2 could use "Liliaceae sec. author2" which is connected with the appropriate description 2. It would be possible to build a large federated database, where all the species and lower descriptions are shared, but where different organizations prefer different taxon hierarchies and provide different and appropriate descriptions for higher taxa that differ among the treatments.

Further thought needs to be given to the fact that currently specimen can only be identified a single time. This may cause problems with synonyms, e. g. if the identification points to "Ustilago violacea", which part of the project group considers adequate, whereas others assume it must be removed from Ustilago and recombined into a different genus (Microbotryum violaceum). This problem can perhaps be solved, by placing specimens in the tree both under the accepted name or under a name listed as a synonym.

Currently for the first version of the schema only a single hierarchy tree is supported. It is, however, enclosed in a collection element to indicate that multiple trees may be supported in the future. In this case an additional data item should be added indicating which taxonomic tree is considered the default tree.

Finally we could not resolve the question, whether to allow anonymous nodes in taxon trees (= ClassHierarchies). The advantage of anonymous nodes is that detailed phylogenetic knowledge (esp. molecular or cladistic) about tree topology can be integrated into the tree. For example, groupings of species within a genus can be defined, regardless of whether these groupings have been given taxonomic names or not. The disadvantage is that trees containing anonymous nodes may be more difficult to handle in user interfaces, because such nodes can only be handled in tree view editor, not through simple interfaces displaying nodes as lists with parents.

SDD schema version 0.9 does implement anonymous nodes in ClassHierarchies. Please raise a discussion on the SDD Wiki if you have arguments not to do this.


Statistical parameters and repeated observations

Gregor presented a proposal to support repeated original observations (see CharacterDataRefType in the schema).

A coded description may consist of any sequence of character data and observation sets. The observation set contains repeated observations, each of which may contain multiple characters that have been observed together. An example is "leaf shape, length, and width". The repeated original observations may be numerical or categorical.

Probably (and not yet implemented) it should not be possible to define statistical parameters inside the ObservationSet. This requires a further type, lacking the option to refer to statistical parameters.

Bryan Heidorn pointed out that the binding between the descriptive univariate statistical measures and the original observations is missing. It is possible that several sets of measurements, each with a mean could be entered, or that both a set of original observations has been entered (for which a mean can be calculated), and statistical measures that are externally calculated from these observations. This issue is unresolved.


Character dependency

[Editor's note: the discussion on Character dependency is particularly roughly represented here. The issue is particularly problematic, and I am currently confused myself about whether indeed DELTA inapplicable and applicable directives are convertible (both Cyril and I think they are not, Mike Dallwitz maintains that they are)]

Character dependency in general is a statement expressed in the terminology (i. e. for all descriptions) between one character/state combination and other characters.

DELTA provides two different directives: a positive APPLICABLE CHARACTERS and a negative statement INAPPLICABLE CHARACTERS (= DEPENDENT CHARACTERS). They differ in the applicability of the controlled characters in the case that the controlling character is not scored at all, and they differ in their behavior when more than 1 state has been scored in a character. Applicable Char. and Inapplicable Char. directives in DELTA exclude each other.

The Delta user guide give the following definition for APPLICABLE CHARACTERS: "This directive specifies the values of 'controlling' characters which make other 'dependent' characters applicable. If, in a given item, a controlling character takes only values which make its dependent characters inapplicable, or if the controlling character itself is inapplicable, then the dependent characters must not be given any values (other than the pseudo value 'inapplicable' (see Section 2.2), which is redundant and will be removed by translation into DELTA format (see TRANSLATE INTO))." and displays as an example:

* APPLICABLE CHARACTERS
10,1:11
16,1:17
20,1:21-24
32,1:33-38
39,2:40-43
47,3:48-51
55,1:56
57,1:58-59
68,1:69
78,1:79-80
78,2:79:81
78,3:82
78,4:83
78,5:79:84
The DELTA users guide states that "This is equivalent to the DEPENDENT CHARACTERS directive in Section 3.4.1", which is:
*DEPENDENT CHARACTERS
10,2:11
16,2:17
20,2:21-24
32,2:33-38
39,1:40-43
47,1/2:48-51
55,2:56
57,2:58-59
68,2:69
78,1:81-84
78,2:80:82-84
78,3:79-81:83-84
78,4:79-82:84
78,5:80-83

Examples of DELTA-like statements:

Character 1: "Leaf presence" (Dependency or Inapplicability statement)
  1. Present   RULE: (none)
  2. Absent    RULE: If scored char. 5-9 are inapplicable, else applicable
Character 2: "Place of infection" (Applicability statement)
  1. Leaf     RULE: Only if scored char. 10-14 are applicable, else inapplicable
  2. Root     RULE: Only if scored char. 15-18 are applicable, else inapplicable
  3. Ovary   RULE: (none)

These rules cannot easily be converted into each other, even if the default (unscored controlling character) behavior is neglected. DeltaAccess attempted to use the complement of controlling states in the case of the applicability statement, but this only works in some situations. For example:

[To do: Create equivalent to XPER example that fails in DeltaAccess: ...]


Cyril Gallut presented the solution for character dependency in XPER:

Cerastium Example (by Cyril)

The length L in the illustration is only defined, if a species of Cerastium has both an apical horn A and at least one anterior horn B or C:

   /\
  |  |
  |  |  apical horn A
  |  |
  |  |
  /   \
 /      ---\     ===
|           |     |
|           |     |
|           |     |
|           |     | Length
|           |     |   L
|           |     |
|           |     |
|           |     |
 \  ____   /     ===
  | |   \  |
  | |    | |
  ||     | |
  ||     | |    left/right
         | |    anterior horn
  B      | |
         | |  C
         \_/

That is, the character L is applicable only if A: present AND (B: present OR C: present). Can this be expressed in terms of DELTA-like applicable/inapplicable statements?

The following seems to be a solution:
  L is inapplicable if A: absent
  L is applicable if at least B: present
  L is applicable if at least C: present

The assumption here is that if two applicability statements control the same character A, it is sufficient that one of them turns the character from its default applicability "inapplicable" to applicable. The behavior is slightly odd insofar as L is inapplicable if B and C remain unscored and becomes applicable if B or C are scored, but A remains unscored. Also note that this solution is not possible in DELTA, since applicable and inapplicable directives can not be combined.

The DELTA statements seem to combine an And/Or logic with different settings for the default (i.e. when the controlling character has not been defined.

We need examples whether it is important to have: One character being controlled by several controlling characters. Cyril Gallut: This would not yet work in the current logic employed by XPER.

Character dependency can clearly be hierarchical. The character "leaflet length" can depend on a character/state defining that the leaf is compound and has leaflets, which may depend on a character that defines whether the leaf is present of absent.

How is this hierarchy followed? If a controlling character becomes inapplicable, does this mean that automatically all controlled characters should become inapplicable? Does this behavior differ for the Inapplicable and Applicable rule?

Variant character dependency implementation in XPER

Cyril Gallut informed use about a variant way to express character dependency. In contrast to DELTA, where the controlling states are independent (see comment by Mike Dallwitz on the SDD email list) it uses sets of controlling states. That is the controlling condition Character 1, states 1 and 2 control Characters 5-10 are not identical with the DELTA definitions: Character 1,1-2 control Characters 5-10 (which is defines that either 1 or 2 control 5-10).

The XPER definitions have the form of a 3-tuple (controlling character, set of controlling states, set of controlled states. The XPER rule is that if the set of currently scored states in the controlling character and the complement of the set of controlling states intersect (non-empty intersection set), the character is applicable, else inapplicable. XPER currently does not use a rule for the case that no state is scored in a controlling character, but for the discussion we have allowed an additional rule to make controlled characters applicable whenever no state has been scored. Using the examples from above:

Character 1: "Leaf presence" (Dependency or Inapplicability statement)
  1. Present   RULE: (none)
  2. Absent    RULE: If scored char. 5-9 are inapplicable, else applicable
Character 2: "Place of infection" (Applicability statement)
  1. Leaf     RULE: Only if scored char. 10-14 are applicable, else inapplicable
  2. Root     RULE: Only if scored char. 15-18 are applicable, else inapplicable
  3. Ovary   RULE: (none)


Char. 1. Controlling states = {2}, complement = {1}Controlled char.5-9
a) if unscored -> Applicable
b) if 1 scored -> {1} intersection {1} <> {} Applicable
b) if 2 scored -> {2} intersection {1} = {}Inapplicable

Char. 2. Controlling states = {2,3}, complement = {1}Controlled char. 10-14
Char. 2. Controlling states = {1,3}, complement = {2}Controlled char. 15-18
a) if unscored -> ApplicableApplicable
b) if 1 is scored ->
       {1} intersection {1} <> {} Applicable
       {1} intersection {2} = {} Inapplicable
c) if 1 and 2 are scored ->
       {1,2} intersection {1} <> {} Applicable
       {1,2} intersection {2} <> {} Applicable


Problems with this:


Can character dependency rules be integrated into the character grouping trees?

Kevin further proposed that in the example of leaf presence and place of infection, it may be helpful to have dependency acting on states, rather than entire character. The state "leaf" could be hidden, if it is known that leaves are absent. No current application seems to support this behavior. It was agreed to consider this a possible future extension of the dependency options, but not to consider it now.


Additional (unscheduled) session Friday, 24. October 2003, 10-15:00

Participants

Chapman, Arthur (CRIA) biodiv-at-achapman.org
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois) pheidorn-at-uiuc.edu
Morris, Robert (U. Mass. Boston) ram-at-cs.umb.edu
Thiele, Kevin; K.Thiele-at-cbit.uq.edu.au

The "Dinner Problem"

The discussion was primarily dedicated to a serious problem detected a few days earlier (over dinner ...), i. e. that the Brazil and Paris schemata do not work as expected regarding the project-wide definition of reusable state sets. The keyref of the State element in Descriptions would have to alternatively refer to a locally defined state within the current character, or directly to a project-wide state definition. This causes problems with the xml key constraint in the SDD schema. One way to achieve this would be to define the constraints using wild card selectors (and probably rename the key attributes so that the wild card path can only refer these to attributes.

Even if such (slow performing) wild card statements are accepted, they no longer control the association between characters and global states (i. e. only states defined in a character can be referred to in the description for this character). If global state 1 is referred to in the definition of character 1, but not in the definition of character 2, a reference in the description from inside character 2 is just a valid as one in character 1, since the keyref/key connection works through wild card mechanisms. We agreed that it is not absolutely necessary that all validation must be performed based on information in the schema (additional xslt code could be created for validations beyond those a validating parser would automatically do, but that it is highly desirable to control the validity of the terminology through schema mechanisms, esp. since this is already the case in all other cases in the current SDD proposal. To enable the control it would be necessary to distinguish whether a given state is locally defined in a character, or defined through reference to a project-wide definition, and to use two different element names for these cases. This, however, runs contrary to the design aim to provide a smooth mechanism to extent the definition inherited from project-wide state sets. It complicates applications, and even worse, it does not allow any redefinitions of states, e. g. changing a locally defined state later to a shared, project-wide state.

In the Brazil/Paris schemata we had planned two alternatives of referring to shared state sets: Either a set is accepted as a whole, or only a subset of selected states from a set are accepted in a given character. The intent was to allow relatively logically defined sets of property states (e. g. all 2-dimensional shapes a leaf may have), but then to allow restrictions to only select the few shapes known to occur in the inflorescences (e. g. removing all ground-leaf-shapes).

These two alternatives have significantly different properties. The restriction mechanism lists the selected states and allows defining a key that is local to the character. Consequently, it could potentially be handled by the same mechanism as locally defined states. The whole-set mechanism only refers to the set, which can not be reflected in description data. However, a reference to the entire set has advantages if the character definition is changed (evolution of terminology). If a set of color states is used in many characters, and an additional color is added, references to the entire set are automatically updates, whereas definitions using listed references to states need to be manually updated in each character.

We considered it undesirable to define different mechanisms to refer to shared states from within the description, depending on whether they are inherited as a set or whether the set is restricted to a list of state references.

Replacing character containers in descriptions by direct references to states

The question of how desirable it is to keep state keys independent of characters was an important secondary point. Our hope was to get guidance about in which direction we should seek the solution for the "Dinner problem", by resolving this point:

State references from the descriptions to locally defined states (i. e. defined within a character, not shared) use a project-wide unique key value. Consequently, the character reference (currently provided by enclosing state references in the descriptions with a character container, i. e. <character keyref="123"><state keyref="123"></state></character>), is redundant. The same is true for statistical parameters, which although referring to project-wide definitions, provide an additional key within each character. The descriptions refer to this within-character key, not to the project-wide key. In pseudo-instance-code:
StatisticalParameters:
   StatisticalParameter key="999" / Label="mean"
CharacterDefinitions:
   Character key="1" / Label="Leaf length"
       StatisticalParameter key="101" keyref="999"
   Character key="2" / Label="Leaf width"
       StatisticalParameter key="201" keyref="999"
Descriptions:
 CodedDescription:
   Character keyref="1"
       StatisticalParameter keyref="101"
   Character keyref="2"
       StatisticalParameter keyref="201"

Screen shot of white board during discussion of this topic:

whiteboard snapshot

Right side: Reusable are CodingStatus, StatisticalParameters [= StatisticalMeasures], and Shared (or global-) States. Characters can have locally defined states plus references to globally defined states. Statistical parameters always refer to the global definition (as would CodingStatus do, if we decide to define them inside character on a per character basis). Shared States are organized into sets. The first Shared State Set A contains states 1, 2, and 3. Referring to Shared state sets may be partial as in "Refer State Set A" with only 2 out of 3 states, or total as the reference to set B. In Descriptions, Statistic references do not refer to the global definition, but to the within-character enabling definition (which in turn links to the global one). States should work the same way.

Left side: Should local and global state definitions inside character be different, or can they use the same method?

whiteboard snapshot Global or Shared States (here "blue") can be reused in several characters, here with Eye together with Red and Wing together with Gold. Pointing from description to these requires no character context, when pointing to the within-character keys (red arrows), but would be ambiguous when pointing to the global key (green arrow).

 

Having character-specific state keyrefs has the following advantages or disadvantages:

The disadvantage above can be avoided if the descriptions make no reference to the characters at all (only to states). We discussed which implications this would have.

Both options are needed. In some cases a researcher would like to explicitly define state sequence differently in each description (because the order of statements is traditionally considered to be a weak expression of relevance, frequency or probability, i. e. "round or obovate" is considered "usually round, sometimes obovate").

Which option to choose is decided by the value of the Sequence element in a character reference in the description Descriptions/CodedDescription/Characters/Character/Sequence. Given that this is a currently modeled as a character property it is not possible to say: This state should come first, and the others in the sequence in which they are defined in the terminology. This limitation was considered and accepted.


XML schema diagram, showing the proposal to remove Characters and Character from CodedDescriptions and replace them with an unordered bag of state-, statistic-, etc. references:

intersection

Note that already in the case of the Characters/Character collection case the character sequence in the xml file is not considered informative! The sequence within the character is considered only informative, if a special data element "Sequence" is set to "description" rather than to its default "terminology".


Saturday, 25. October 2003

Participants

Armstrong, Kate; k.armstrong-at-rbge.org.uk
Franz, Nico (SEEK)
Glück, Karl (BGBM) k.glueck-at-bgbm.org
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois) pheidorn-at-uiuc.edu
Hinchcliffe, Sally (RBG Kew) S.Hinchcliffe-at-kew.org
Kennedy, Jessie j.kennedy-at-napier.ac.uk
Morris, Robert (U. Mass. Boston) ram-at-cs.umb.edu
Neves, Susana (Itqb)
Pullan, Martin (RBG Edinburgh) m.pullan-at-rgbe.org
Souza, Sidnei, (CRIA) sidnei-at-cria.org.br
Thiele, Kevin; K.Thiele-at-cbit.uq.edu.au
Thiers, Barbara (NY Botanical Garden) bthiers-at-nybg.org
Whitbread, Greg (Australian National Botanical Garden) ghw-at-anbg.gov.au

Agenda

Kevin Thiele: Talk

Because most participants present on Saturday had not been present on Tuesday, Kevin repeated a shortened version of the introductory talk he had already presented on Tuesday. The slides of the presentation can be seen online under "Federated Description Services and the Library of Life - or - What can we do with SDD anyway?".

Bryan Heidorn: Digital Models for Taxonomic Description and TeleNature (Talk)

Bryan presented his work on bibe (biological information browsing environment) and OpenKey in the context of the digital Flora of North America, and TeleNature, a citizen science project with the aim that "Anybody should be able to identify all known species. They will then be able to know when they are looking at new, previously unknown species."

Attribution Model in SDD

Introduction

Attribution is important for recording intellectual property. Detailed records are important in large collaborative projects to improve trust and the willingness of collaborating. The model as shown in the SDD schema was discussed. Attribution is available in SDD both for descriptive as for ontological/terminological work. [Editor's note: in the released version SDD 0.9, most information can be found in a container named RevisionData].

The discussion is complicated, since several issues are closely related:

Some of these issues, especially the IPR aspect, are obligatory to provide usable and revisable SDD data sets. Other issues are really management issues, which could either be supported in SDD to provide some degree of interoperability, or which could be supported by individual applications at their own choice (no interoperability).

We already discusses this highly controversial topic in Paris, see: Attribution, credit or acknowledgment for contributions and work on the project ("meta data").

Gregor's attribution model

Attribution on a single level is provided for all kinds of objects, e. g. character or concept definitions in terminology. In Descriptions we have the additional need to cite the data source as part of the IPR recording. (This is an unfortunate break in the revision model, since data already in SDD format are directly revised, whereas outside data have to be cited. There seems to be no way around this, since it is impossible to digitize descriptive data in a fully structured, terminologically clean format without interpreting and revising them.)

Since multiple containers are required for the purpose of citation, these can also be used for contributions. Thus it is both possible to revise an existing Description by adding or removing data, and adding a separate description. In the first case, the authors work as a team, and individual contributions within a description can not be separated, in the second case the IPR of the contributors is held completely separate. The consuming application would combine all descriptions referring to a class name (= taxon) to obtain the class/taxon description.

Kevin's attribution model

Single author for project, or:
multi authors:
  organized by taxon (author 1 tax. 1-20, author 2 tax. 21-40)   organized by character (author 1 tax. 1-20, author 2 tax. 21-40)   or change single data item

Kevin proposes to find some model of inheritable details, i. e. statements could be made on different levels and consuming applications have the responsibility to trace the hierarchy of objects to find the relevant attribution data. Gregor: This provides a great deal of flexibility in making attribution statements, and minimizes the amount of storage space or duplication needed, but also places a heavy burden on consuming applications.

Sally Hinchcliff: attribution in IPNI

IPNI is: Name + citation details plus lookup tables: authors, ranks, publications. Problem even single letter change may be relevant, and needs to be traced.

All database entities are versionable, like in a CVS. Every edit adds a new version of a record. Is there a difference between last one, and last good one? Some users have status to be automatically implemented, other users changes need approval.

Contribution is split into (database-object-who-why-when) plus multiple (fieldchanged-valuebefore-valueafter). Only the latest version is kept, plus change-log to document it, or roll it back. Pending contributions have not yet changed the record and are only in the change log, approved contributions are executed on the record.

Problem is concurrency between authorized users, currently in IPNI simply a conflict is generated that is for humans to resolve.

Conclusion: if a system tracking all changes is required anyways to provide sufficient trust in large projects, it will automatically generate data that fulfill the management and IPR requirements of attribution.

Main discussion

[Editor's note: My apologies, I have too few notes on this topic to provide a truly adequate report of this discussion.]

changing class description may change concept. Taxonomic concept and descriptions are interrelated. How do you change a concept without providing a description, is that possible at all? Object/specimen is perhaps easier than Class/taxon?

It is difficult for computers to make a distinction between "leaves hairy (usually at tip)" to "leaves hairy (usually at tip)" and "leaves hairy (never at tip)". The latter changes the concept, UNLESS it was already a typo before.

Addition to specimen description: a) additions of new data. Order may be relevant. Distinction between add and edit is weak, especially in an atomic model where you change the character through adding or removing states.

Sally: perhaps option to publish version history through url separately from the SDD document. Interest to recreate previous version (current is 1.2, need version 1.0).

Kevin proposes that SDD does not store history tracking, but applications may keep track of it.

Contradiction and controversy

The SDD model currently deals inadequately with contradiction and controversy in large databases. First we need to remember that the Description object can not be identified with the concept of a "Taxon" (this is similar to DELTA, but most people using DELTA make such an identification). As a consequence, it is quite possible in the current model to have unresolved controversy by having different descriptions identified to the same class (or even specimen = object) name. Two issues can be distinguished:

We could not come to a satisfactorily solution here.


Correct Term for specimen-generalization: Instances or Objects?

On Thursday we had decided to change Taxon to Class and Specimen to Instance to make it clearer that the SDD standard is not limited to biological/biodiversity descriptions and identifications. This subject was reopened on today (Saturday) after Kevin had used the term "object" rather than "instance" throughout his introductory talk. Furthermore, it was noted by Martin Pullan that Instance is a result of classification, not a term for a specimen itself. The real-world object in biology is independent of instantiation; instantiation is rather the assignment of a class name to an object in identification. Consequently, while keeping the term class as a generalization for taxa, we decided to change "Instance" to "Object" in the schema.


Correct term for processes inferring information in object or class trees

Since we now have a proposal for object and class hierarchies, we went back to the discussion of the inference processes, and the correct term to use for inference on a tree, and viewing information up and down the tree. A discussion on the list was previously already summarized in "Choice of appropriate term for inferring information on object or class trees". The discussion in Lisbon has been added to this document.

Summary:

Inference is the preferred head term to refer to the following processes:

See the discussion document for my minority vote on this :-).


Character Ontology

PartOf is in the tree, but KindOf (berry and capsule are kinds of fruit) is not. It can be expressed as a character, but then the relation to the structure is missing. The current situation seems to be rather unsatisfactorily in this regard.

[Editor's note: After the meeting Gregor proposed to remove the flat list of shared state sets from the SDD model and place them into the ConceptTrees (formerly CharacterGroupings). These trees can express structural or part hierarchies, property hierarchies, or method hierarchies. Most shared state sets would be placed in the property hierarchy (values for color, shape, etc.), but the kind-of information would be placed in the structural tree. Also method states (light microscope available or not) could be placed, although these may need additional properties. I envision the latter as unrecorded decision states, introduced only for the purpose of dependency rules. This needs further discussion!]

The question is intimately related with the dependencies that automatically arise from the character ontology. If the PartHierarchy contains Structures and types, then leaflet characters are only applicable if the leaf type is compound leaf, having leaflets. Is there a more elegant way than the classical DELTA-like character dependency model?

Bryan uses: related term, broader term, narrower term.

Gregor will attempt to provide a more complex model of expressing ontological knowledge within the current GlossaryEntries. [This is now included in SDD 0.9 released Dec. 1. 2003]


Name for "SDD" standard

The choice of name for the standard was on the agenda, but in the end the time was insufficient to discuss this topic on Saturday. However, in the plenary SDD report on Sunday we touched this topic. The important question, whether all TDWG standards should try to follow a "naming pattern" to make them recognizable as a family of standards was raised. At the moment only one standard (ABCD) has a relatively fixed name. Although it was recognized that a naming pattern would help to advertise TDWG and to point from one standard to another, the general opinion seems to be that it is too difficult to find names that are easy to memorize and have such a pattern. The family of XBio, XTax, XSpec (= collections), XDelta, XMonograph, etc. names present in the proposed names for SDD found (to my chagrin :-) no positive echo.

As a result, the final name for the SDD standard is still undecided!


Points that were on the agenda but were not discussed:


Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)



Return to the SDD starting page.

First published 2003-11-25, last update: 2003-12-12.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser