TDWG working group:
Structure of Descriptive Data (SDD)

Minutes of the SDD meeting in Christchurch, New Zealand, 11-17. October 2004

(Version 0.9)


Summary

A primary goal of this year's meeting was to advance SDD on the standards track. The procedures of TDWG require that in order to vote on a standard, the TDWG meeting in the preceding year has to review the standard. 60 days prior to the meeting a review version must circulated. Accordingly, SDD version 1.0 beta 2 was published in due time and a call for review was made on several email lists. At the meeting in Christchurch itself a plenary session (Tuesday afternoon) was dedicated to the public review and criticism. The plenary review was favorable with little criticism.

However, in discussion during the main meeting in the two dedicated days of SDD workgroup meetings, important points were raised that need to be addressed. A focus of the meeting was to compare the current features of SDD with the features required by Lucid 3 and the CIPRES project. This meeting document first contains a protocol of the plenary session, and then minutes of the SDD workgroup meetings.

Note: 180 days before next years meeting on which SDD is up for voting, the changes to the standard have to be finalized. This date is approximately March/April 2005!

The updated and most recent version of the schema that resulted from the discussions can currently be found on the WIKI under CurrentSchemaVersion.


Table of Contents

Tuesday afternoon, 12. October 2004 - Plenary TDWG Session to review SDD standard proposal

Friday morning, 15. October 2004 - "Cafeteria meeting" with focus on the phylogenetic CIPRES project

Saturday, 16. October 2004 SDD Working group session, day 1

  1. Links with GBIF and species bank
  2. Data interoperability and roundtripping
  3. SDD light - which subsets to create
  4. Information about Linnaeus II / ETI
  5. Information about Lucid 3 - relations to SDD
  6. Scope of illustrations that are specific to a description (in contrast to those illustrating terminology)
  7. Should character have rich label/wording/images (like concepts and states)?
  8. Character external linking (sharing the operational terminology)
  9. Deprecating characters


Sunday, 17. October 2004 SDD Working group session, day 2

  1. Provide true LSIDs or shortened ID (removing the constant part?)
  2. Glossary - Ontology - use OWL?
  3. Primer and other documentation
  4. Name for DataInterface/Proxydata
  5. Should images be describable?
  6. Review of coding status design
  7. Revision of modifiers (applicability to state versus character)
  8. Repeating characters within a single description container
  9. Explicit statement of the boolean rules underlying characters and states in a description
  10. Problem with Audience/Language optionality and schema validation

 


Tuesday afternoon, 12. October 2004
Plenary TDWG Session to review SDD standard proposal

(Protocol by Shirley Cohen - many thanks to her! The text is slightly edited by Gregor Hagedorn, all errors are his!)

Schedule:
14:30 Introduction to SDD (Gregor Hagedorn, BBA, Germany)
15:10 SDD support tools (Jacob Asiedu & Robert A. Morris, Univ. of Mass. at Boston)
(15:30 Coffee)
16:00 Example instance documents (coded description) (Robert A. Morris, Univ. of Massachusetts at Boston)
16:20 Abstract characters and modifiers (Gregor Hagedorn, BBA, Germany)
16:40 Open discussion - "Your FAQs"
17:30 Close

The following presentations are available as powerpoint files:

Note: Some background to the use of UBIF the UBIF proxy data proposal can be found in the following presentations held in other sessions at TDWG:

 

Introduction to SDD (Gregor Hagedorn, BBA, Germany)

(Available as 0.9 MB powerpoint and 8 MB pdf.)

What are descriptive data? "Descriptive data inform about the state of repeatably observable, inherent properties of objects (= individual organisms) and classes (= taxon)". Descriptive data are mostly about morphological data. They have a long history in biology. The definition of descriptive data includes "chemical/enzymatic features or molecular data". The driving force is identification. Relevant are hierarchies of phylogenetic relations. Descriptive data are the mapping between the real world and abstract world. This mapping can be called the "library of life" or "key to life". Better names are most welcome!

Different types of descriptive data exist. First, there is a need for terminology. Coming up with a terminology requires expertise and experience. There is a split between the terminology that is fundamentally defined and the selection of a preferred terminology. The preferred terminology in SDD is referred to as an operational terminology (definition of characters, states, etc.). The standard keeps operational terminology short and all other terms are defined only in a glossary. Operational terminology allows you to add illustrations and refer for more information to the fundamental terminology (glossary entries). Glossary entries are not mandatory. "Concept" trees represent hierarchies of characters (including their states). In addition, SDD allows you to make assumptions about characters. It has multilingual support. You only have to translate the terminology, then you have a multilingual description.

Once you have a terminology, you can then create descriptions. This process can be conceptualized as creating a data matrix of character x taxon item whereby every cell in the matrix can have multiple values or observations. Another way to conceptualize this is a questionnaire form. A form restricts the way you can answer questions. It gives you some structure, simplifying the interpretation of answers.

Another form of description is natural language. This is not ideal because it can lead to errors of interpretation. There is too much freedom associated with this form of description. What are needed are tools to produce higher-level quality data. DELTA and NEXUS-based tools are examples. Moreover, once you have structured data, there is no reason you can't generate the natural language descriptions from the structured data.

Delta has a large user base. However, it has legacy problems and 170 directives. It has some principal limitations. In 1999, there was a complete revision of Delta. It was concluded that Delta needed a complete redesign, which led to the start of SDD. SDD is currently at version 1.0 beta. SDD remains complex. One possibility is to have a light-version of SDD that would make it easier for most people to code against. SDD is backwards compatible to Delta and Lucid. Except for selected rejected concepts DELTA and Lucid data should be fully expressible in SDD.

SDD Design Goals: The schema is strongly-typed. It has very few anonymous types. It is close to an object-oriented schema. The inheritance mechanisms are limited, but the schema allows for extensions. SDD is less concerned with human readability and more concerned with the relationships between objects. SDD should not be exclusively bound to the biological domain. That is why we use the term "class" and not "taxon", and object instead of specimen or observation.

SDD should be in one complete format and be able to manage data during the analysis process as well as expressing final, fully revised work. It should be able to support sample values. It support images. The form structure should limit one's ability to express oneself using natural language. DELTA in some respects required too much to express in annotations and comments. In SDD, many of these annotations have structure through the use of modifiers.

SDD is a UBIF application. The metadata section will be identical to that of ABCD's. Element container: "External Data Interfaces" = interfaces to the outside world, such as taxa, agents, publications. The "class" section contains taxa. Agents are e. g. persons. All are proxies that may refer to external data sources. Within the UBIF general structure, the SDD data are in the DescriptiveData element. The description section consists of: terminology, coded descriptions, natural language, identification key. Configuration section allows setting of defaults.

It would be useful to have a common library of terminology. However, this feature won't be part of this version of SDD. If you allow for modification, you need to have support for character revisions. In 20 years, we hope to have a more stable terminology, but for now, we need to have flexibility. Increased support for reusability of definitions across projects is desirable, but options still under debate.

Stinger asked if there are any formal use cases and Bob said they would produce some in the coming weeks.

SDD support tools (Jacob Asiedu & Robert A. Morris, Univ. of Mass. at Boston)

(Available as powerpoint.)

Developed 2 SDD support tools: SDD Description Editor and SDD Debug Ref. The editor is used to construct SDD documents. The editor is written in C#. The editor assumes you have a valid terminology in the SDD document. The editor enforces constraints and prevents scoring of states. The editor is still a reference implementation. We haven't decided to turn it into a product yet.

SDD's ids can be seen as primary keys and refs can be seen as foreign keys. These ids are constrained by XPath, e. g. they have to be unique within a document. The Debug Ref tool has several modes of operation. Showed demo. Projects are supported by Boston Electronic Field Guide.

Example instance documents (Robert A. Morris, Univ. of Massachusetts at Boston)

(Available as powerpoint.)

Showed us what an SDD document looks like through XMLSpy. Browsers are decent viewers. XML should be produced by exchange tools such as a database application, not by hand. Most of the schema sections are optional. Introduced NSF project "Ants of the American Museum Congo Expedition". How good is SDD as a vehicle for answering questions using XQuery?

SDD doesn't prescribe how to make a description. SDD is a meta schema. Its purpose is to constrain how to write descriptions. Showed 2 examples of SDD documents from the Ant project. SDD editor wouldn't let you score a thorax shape as blue. However, an XML schema could. Therefore, there is value in having an SDD editor like Jacob's.

Concept trees are ways to organize states (e. g. body parts such as head, eyes, mouth).

Introduction of XQuery: A recently standardized language for searching through XML documents. Existing XQuery engines: Berkeley DB.XML. Bob's student did a study on how well the different XQuery engines perform. Results will be posted on the Wiki.

Abstract characters and modifiers (Gregor Hagedorn, BBA, Germany)

(Available as powerpoint and pdf; this is the same file referenced for the first talk.)

What is a character? Characters can be categorical, quantitative, numeric, statistical, range (e. g. color range). SDD allows you to define a new character type. SDD needs to support molecular types. Character shapes could be defined by a mathematical function. Abstract characters versus concrete characters. Process for creating a new character type: Create 1) definition type 2) coded data type 3) markup text type and 4) perhaps sample data type.

There may be a mapping between character types. These mappings are defined in the terminology of a character.

Modifiers act on a statement. Types of modifiers: frequency, probability, spatial, temporal, degree/kind.


Friday morning, 15. October 2004
"Cafeteria meeting" with focus on the phylogenetic CIPRES project

Participants

Asiedu, Jacob (Univ. Boston) kasiedu-at-cs.umb.edu
Cohen, Shirley (Texas Advanced Computing Center, Austin, CIPRES)
Guralnick, Robert (University of Colorado)
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Miranker, Daniel (University of Texas at Austin, CIPRES)
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Thiele, Kevin (K.Thiele-at-cbit.uq.edu.au)

Discussion

This unscheduled meeting was held because Shirley Cohen and Dan Miranker, both working on the CIPRES project, could not attend the SDD workgroup discussions on the weekend. The main purpose of the meeting was to increase the knowledge of the SDD about CIPRES and vice versa.

Shirley Cohen explained the use case of adding characters/states information to regions of interest (ROI) either as free-form text annotations (like arrow pointing from text to feature) and as analytical entities (as defined in SDD characters or concepts) to 3-dimensional images (scanning project, PI Julian Humphries).

Dan Miranker outlined some of the architectural plans for CIPRES. CIPRES is based on matrix data, mostly from the Treebase project. It then uses matrix data for phylogenetic reconstructions. Treebase is currently using NEXUS, but as part of CIPRES it is planned to convert Treebase to a new "Treebase 2" structure having richer representations. In this context, and improved data format would be desirable. Data in current Treebase are often big string fields with relatively little structure. A new version of Nexus ("NexML" - great name) is considered, but also SDD is considered as a link between character databases and the central storage. The preliminary assessment of SDD by CIPRES was favorable.

The relation between SDD and NEXUS was discussed. Especially the Assumptions block in each format needs further comparison. An example is the ordering of states in a tree. This was already considered by SDD in Brazil, but so far not introduced into the schema because we hoped for input from the phylogenetic analysis/NEXUS community.

A major question was, which relatively simple changes/additions could be made to SDD 1.0 so that its utility to the CIPRES project and phylogenetic data analysis community in general in increased. SDD is certainly lacking some features of NEXUS that specifically address phylogenetic analysis procedures. However, no specific recommendations or requests to change SDD could be identified at the moment. The basis of further discussion could be an internal draft design/requirement/road map document by Piel for Treebase2 (or the new NEXUS data model?). Dan and Shirley will ask whether this can be made available to SDD.

One point identified was that although SDD already handles multiple taxon trees in the form of the ClassHierarchy, it does not provide for branch lengths / distance measures. Only the pure topology (but including multifurcations) of a NEXUS tree can be expressed in SDD at the moment. Questions: a) How should this be added to SDD - can we have semantics or do we need a semantic-free floating point value field? b) What would be the appropriate element name, keeping it as general as possible? PhylogeneticBranchDistance? c) Is a single floating point value sufficient?

Also: New Nexus model stores trees as edges - consequences for SDD? Should we give the xml-hierarchy approach up and store our trees in a more relational way as well? The representations are convertible, but the correct decision should be taken now. This is an urgent point to decide! Note that the protocols (Biocase/Tapir) presented in Christchurch also have problems with trees of unlimited depth.

Another issue very important to NEXUS are nucleotide of protein sequences and alignments. Both have two forms: original sequence data and aligned matrix form. The alignment is usually local to a set of sequences prepared for comparison. Global alignments are difficult and rare (exist e. g. in the ARB project). Original sequences are essentially text and easily supported. Unfortunately, since few analyses operate on them directly, and since they are usually vouchered in GenBank, EMBL etc., support is not very important. On the other hand, alignments are important for analysis. Alignments have the property that for one sequence and one taxon, multiple alignments may exist, if the sequence has been compared with multiple, non-congruent sets of taxa. This is strictly neither supported in NEXUS nor SDD.

Since NEXUS uses character strings to store taxon data and identifies character position (= column) with character (except in the case of polymorphic coding), expressing sequence alignments in NEXUS is straightforward. In SDD strictly each sequence position equates a character (the label of which presumable would be just a position number). Breaking down the data like that is desirable for descriptive data where character have significant metadata and where it is desired to manage and rearrange characters. This is the case in morphological data, where in the NEXUS model it would be extremely difficult to manage a matrix of 1000 character, perhaps many with polymorphic coding. However, it is not appropriate for sequences. One reason why this is so it that in sequence data the states at an alignment position are not truly scientifical observations (or measurable), they are the result of a first step of the alignment analysis process. It is often reasonable to insert or delete gaps into some sequences in an alignment, essentially shifting a block of data into the adjacent characters.

Dan Miranker criticized that SDD is too "monolithic" in that it attempts to describe all the relations among its objects. As far as I understood, he would prefer, e. g., terminology and the different kind of descriptions in separate schemata. Gregor's view is that for data to be permanently valuable, both terminology and description need to be combined in some container. Although this does not need to be a single document, it is convenient to allow it in one document, to simplify data archiving and document-like exchange among scientists. However it is recognized that many use case will want to exchange individual objects from the SDD schema. An identification web service may request only the minimal base information for the characters and the preferred concept tree, and later request individual character details, or single descriptions without requesting the terminology again. It remains unclear, how the schema could be reorganized so that both use cases are equally obvious. Gregor's current assumption is that a few lines of xslt code (which should be made available with future distributions of SDD) create an all-optional schema with the identity constraints removed, making it easier to exchange objects as fragments of full SDD.

We agreed that within 60 days (i. e. until December 15.) a request from CIPRES should be made to add certain features that are present in NEXUS or required in other ways by CIPRES, but currently missing in SDD. This request should include features considered general (some features of NEXUS may be more appropriate for application-specific extensions to SDD, using the SDD CustomExtension mechanism).

Shirley will research available options to define regions of interest within 2- or 3-dimensional images (SDD.MediaResources). She is looking at "GML 3", which is the standard that the spatial data people are using and supports volumes.


Saturday, 16. October 2004
SDD Working group session, day 1

Participants

Asiedu, Jacob (Univ. Boston) kasiedu-at-cs.umb.edu
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chenin, Eric (IRD France, French node manager)
Endresen, Dag Terje (Nordic Gene Bank, POB 41, SE-230 53 Alnarp, Sweden)
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Richardson, Ben (Western Australian Herbarium, Dept. of CALM)
Saddi, Chandramohona (http://www.findingspecies.org/)
Thiele, Kevin (K.Thiele-at-cbit.uq.edu.au)
Ulate, William (INBio Costa Rica)
Wilton, Aaron (Landcare NZ, Lincoln; "eBiota")

Agenda

 


Links with GBIF and species bank

Eric Chenin: Species bank GBIF/ENBI workshop ETI in March.

Everyone at the meeting thought the GBIF name "Species bank" should be changed:

Where are the data? Almost always in the hands of commercial publishers. Copyright and database rights issue - database rights is worse since it prevents the recreation of databases. If someone has created a database on a topic, noone else may redo it, not even from scratch. Motion by Gregor to license SDD under condition that data should be made available after reasonable period of time ("10-20 years?") rather than being made available to science only 70 years after death of last collaborating author was not considered relevant and rejected.

Some existing systems:


Data interoperability and roundtripping

Problems when an application is changing data. Situations:

Graceful degradation of data is desired in the design of SDD. For example, if modifiers are not supported, the remaining data are likely to remain usable for an identification application only supporting characters and states.

Important clarification: Roundtripping is NOT intended to go from coded to natural language description and back. Natural language description can be created from coded descriptions, but this should be considered "cached information", if it is stored at all. Consumers should not assume that coded descriptions and natural language descriptions have a relationship! The schema gives no reason to assume so.


SDD light - which subsets to create

Bob Morris: Jpg2000 - gives reader advice on required features. Features used in SDD can be detected, but parsing all may be time consuming. Writer could inform about which features have been used in a summary. But this is like caching a previous parsing of existing data. Could be supported by a separate optional schema and stored in CustomExtension in the SDD Metadata element. Is not essential for SDD at the moment.

What do we mean by "SDD light"? Primarily, the issues are social or educational. SDD light is primarily a "SDD light documentation" rather than a schema. Used for communication about respective capabilities! However, non-normative schema files may be created for illustration and documentation purposes.

Possible SDD light profiles

Other view of SDD light: Advice to consumers and producers about a subset that is reasonable to start with. But don't die on instance documents that have more in it! If so inform user, preferably listing the tag names that have been ignored (see Data interoperability and roundtripping above). Graceful degradation to light versions is largely possible (e. g. removing modifiers). However, dependencies between parts of the schema exist: obvious ones like terminology definition to terminology usage, but also less obvious ones. Example: when ignoring concept trees the feature of state-set reuse is also missing. In an explicit light version this can be removed. Thus the light versions may give advice about recommended "degradations".

Should SDD light also remove validation constraints from the schema? It would be possible to have less validation / facet / identity constraint - recommendation to test any SDD-light export under SDD full. This would remove that SDD light is always valid under full. It would be better to uphold this! Conclusion: this should not be the same issue as "light". Low validation for reduced documentation size or as an intermediate schema during application testing is a different issue and can be done separately.

Aside, an important recommendation / clarification: On successive exports, the IDs to the same object (e. g. character) should be identical.

Perhaps even an "SDD super light": the "hello world" of SDD?

What needs to be there? List of priorities!

=== Terminology ===

=== CD ===

=== KEY ===

Bob will produce a data set of "hello world", "super light encoded". Maybe also identification key. It would be good to create a "hello NLD world" - who volunteers?


Information about Linnaeus II / ETI

ETI is principally interested in SDD, but SDD is not considered helpful to increase income of ETI - therefore no resources can be invested.

Editor's note: The old Linnaeus used free-form formatted text (without semantic or structured markup) for descriptions, and separately NEXUS matrices for categorical information used in interactive multiple-entry identification. No information regarding features required for Linnaeus but perhaps missing in SDD was available at the meeting.

Rob: Species bank: Wouter Addink is responsible


Information about Lucid 3 - relations to SDD

Kevin Thiele gave a very impressive demonstration of Lucid 3 features, from which several discussion topics spun off. Illustrations and character images are shown below, another discussion of modifiers was started, but continued on Sunday.


Scope of illustrations that are specific to a description (in contrast to those illustrating terminology)

Kevin: desirable to illustrate individual states in descriptions, in addition to the existing SDD ability to associate a media resource with an entire description or a character in a description.

(Clarification: in character and state definition, it is possible to associate media with states; the current discussion refers to a specific illustration, like the stamen shape in a specific species. The question is whether it is sufficient to say: these images illustrate stamen shape in this species, or whether it is necessary to say: this image illustrates this shape, this image illustrate that shape.).

We did not have full agreement on the importance of allowing images to be specific to states rather than the character in a given taxon. However, Kevin agreed that it is important to associate an image with an entire character, esp. if the image illustrates all or many states. Thus the discussion is not about either character or state, but rather that character images in individual descriptions are agreed upon, and that Kevin asks to also support state-specific images. Therefore, the first version of SDD may limit itself to character-description, and state-specific description images can be added in later versions as extensions, without breaking existing software.

Editor's note: In the hierarchical schema MediaResources are currently added in two places: at the description, and at individual characters (plus also in terminology). A relational implementation could instead create an entity DescriptionMediaResources, with required media resource and description id foreign keys and optional character id foreign key. Both a model following the hierarchical schema and the relational entity could easily be extended to make an illustration specific to a state in a specific description.


Should character have rich label/wording/images (like concepts and states)?

Current SDD terminology:
Concept: RichLabel - with Wording, icon, images
Character:SimpleLabel - no icon etc.!
State: RichLabel - with Wording, icon, images

Background information: the character labels in a context tree are not character, but tree-specific. For each character x concept tree combination, additional character labels have to be stored in the trees. Example:
Concept Tree 2
  Colors
    Leaf - charid=2
Concept Tree 2
  Leaf
    Color - charid=2

To avoid the introduction of special character x concept-tree node-types to represent the terminal nodes (leaves) in the concept tree, each character reference in a concept tree is modeled through a concept node. This node, instead of being empty or containing further nodes, contains a single character reference. Thus, concept objects provide for each character in each tree rich formatting information, including wording, icons, images.

If both the concept terminal node around the character and the character itself provide rich icon/wording information, rules of precedence would have to be formulated. If an Icon/Image is only present at Concept, this would be used. If an Icon/Image is only present at Character, this would be used. If both are present, the Concept could take precedence. However, if an icon is present at Concept and other selector images at characters, a decision would have to be taken whether precedence applies to the whole rich label object, or to its individual data items.

The purpose of leaving the character a relatively abstract object (with only label) without rich formatting features is to remove the potential duplicity of where the relevant data may be. This duplication is considered potentially confusing to both content creators and application developers and viewed as a potential source of interoperability problems.

On the other side it may be considered more systematic to provide identical features for all kinds of objects. However, this argument would have to be thought through for the case of measures, coding status states, etc. as well. Proposals/ a discussion document on this are welcome! For the time being we decided to start with the character being restricted, under the assumption that it may be easier to extend the richness of the schema in later versions, rather then restrict it.


Character external linking (sharing the operational terminology)

It seems desirable to handle objects from other domains (like taxon names) and objects from the domain of descriptive data (e. g., a character from an external terminology) symmetrically (or similarly) regarding internal/external linking. However, currently they are handled differently. One major difference between the two cases is that the external data can be abstracted and only a short set of "interface data" is used, whereas the descriptive data objects are required in full:

External data source  Internal (in dataset)
Taxa SDD: ClassName proxy
 = full data
    (e. g. LinneanCore or TCS)
 = only short interface,
    especially Label, Link, Rank
Character Character
 = rich data = rich data

We all agreed that the decision in SDD to currently treat descriptive and non-descriptive objects differently is not ideal. However, nobody could propose a truly better general architectural plan. Some attempts to solve the problem linking problem and how to use the GUIDs which would in any case be the backbone of a solution are discussed below:

GUIDs/LSIDs on entire terminology or on individual objects like character/states? How to structure? Examples:

--- External: --- 
Terminology GUID=111
  Characters
    Character id=1
Terminology GUID=222
  Characters
    Character id=1

--- Local: ---
Local Terminology GUID=444
  Characters
    Character id=1
Description
   Character ref=111/1 (= this refers to external character)
   Character ref=1 (= this refers to local character)

Problem: 1. no value-based ref integrity. 2. no cached access. If no internet access is available, a hand-held computer is not able to perform identification. Further problem: desire to extend information in external objects, specifically to add labels in new languages. One option would be to integrate external terminology into local terminology, and allow extensions in the local objects:

--- External: --- 
Terminology GUID=111
  Characters
    Character id=1
      Label language="en": "Leaf shape"
      State id=1 
        Label language="en": "elliptic"        
Terminology GUID=222
  Characters
    Character id=1

--- Local: --- 
Locally integrated Terminology LSID=444
  Characters
    Character id=111/1
      Label language="de": "Blattform"
  Characters
    Character id=222/1
  Characters
    Character id=1 (local without LSID)
Description
   Character ref=111/1
   Character ref=222/1
   Character ref=1

With LSIDs, since the data behind an LSID should never change, we need to discuss what happens if the external Terminology is updated. If character xml-data are considered data behind the LSID, a new LSID is required - at least on revision level. Acceptance can be under manual control or rule-based - review and accept/reject. Question: Could instead all character other than the id be considered metadata in the LSID sense? Then no LSID would have to be issued, if the editors consider a change minor (e. g. an orthographic correction in a label), but a new version could be issued if necessary.

Work flow when internal integrated Terminology is extended - e. g. label in a new language is added. If prefix to id is identical to local LSID, I have different rules, I may make changes, but perhaps have to increase version part. Now I want from above to add new German labels. I have to create a copy of the original shared, and locally edit that.

Terminology GUID=999
  Characters
    Character id=1
      Label language="en": "Leaf shape"
      Label language="de": "Blatt form"
      State id=1 
        Label language="en" "elliptic"        
        Label language="de" "elliptisch"        

Fundamental scenarios of sharing terminologies

What are the properties?

Fundamental question: Are links references to the remote objects, or are they copies?

The sequence of importing/linking to external terminology and changing/extending it, can be complicated. Example:

Day: Action:
1   Take char 1, 2 3
2   add german labels to local representation of linked ones
2   create local characters 4-9
3   get additional external characters 10, 12
4   add german labels to those
4   improve phrasing of label for fully local and locally extended characters

The argument here is that with large and complex character terminologies (e. g., 800 characters/5000 states is not unusual) a simple document-based CVS system can not be the solution. Although the CVS perfectly allows to track any changes, a relatively straightforward desire to propagate improvements" - e. g. german labels added - back to provider to be considered for integration may be impractical. A relatively straightforward request like: "Which extensions were made to all characters obtained in the last 2 years from terminology provider A? List these and inform the provider to consider for integration!" may require thousands of changes (provided the xml-file has been entered every day into the CVS to be analyzed manually.

Another problem: currently there is no "locked" terminology mechanism in SDD. You also cannot easily know which parts come from external source, and which from internal!

One solution could be to add 2nd attribute to explicitly reference something from external terminology? Solution 1: simply add information about the source Terminology: Character id="1" sourceterminology="urn:lsid:...." sourceid="2"

Solution 2: Duplicate, keep a proxy/cache of the original TerminologyInterface: ExternalCharacter sourceterminology="urn:lsid:...." sourceid="". Plus then in the integrated terminology have a locally modified version?

No final solution of decision could be found.


Deprecating characters

Proposal by Ben Richardson: add to terminology objects an attribute like: deprecatedOn="-datevalue-" (or "deprecated_since"?). If in public shared terminologies concepts have to be changed in a new version in ways that are no longer backward compatible, a new character will have to be created, but it is usually undesirable to remove the old character immediately. Instead it could be "deprecated", i. e. existing data may still refer to the character, but the authors of the terminology recommend against using it for new data and indicate that the character may be removed in the future.

Important question: how important is the date? It is useful but would require an additional attribute. The status of deprecated itself would fit into the pattern of RevisionStatusEnum, by adding a value past "FullyRevised"! On the other hand, some uses of revision status may not desire the over-the-edge level of deprecated. Also, both term and concept may not be known to many people who are comfortable with the other levels of revision status. This seems to indicate that separate handling may be appropriate.


Sunday, 17. October 2004
SDD Working group session, day 2

(Note: the day continued first with the discussion on sharing terminologies, which in its entirety is reported under Saturday, above)

Provide true LSIDs or shortened ID (removing the constant part?)

True:
 Character id="urn:lsid:lsid.gbif.org:SDD-Piepenbring-Smutfungi:Char/1" Pseudo:
 Character id="lsid.gbif.org:SDD-Piepenbring-Smutfungi:Char/1" (and in Description the refs:)
 Character ref="urn:lsid:lsid.gbif.org:SDD-Piepenbring-Smutfungi:Char/1"  Character ref="lsid.gbif.org:SDD-Piepenbring-Smutfungi:Char/1"

"Bob now asserts that you need LSIDs for all objects in terminology if this is how you are going to reference external - LSIDs should be semantically opaque."

Result: Link/LSID changed from LSIDBody type (i.e. excluding the constant prefix 'urn:lsid:') to full LSID always including it, despite being constant. This change increases file size and amount of data to be transferred, but it simplifies interaction with agents that infer the type from a URI string rather than having knowledge about an element being typed. Compare the discussion on the WIKI under UBIF.GuidLinking.


Glossary - Ontology - use OWL?

Can Glossary be built on an extension of OWL? How?

Bob/Dave Thau:

Does OWL support following Glossary features (or be extended to do so?)
a) multilingual text
b) sensu label
c) images/media resources
d) citation

Bob: OWL intended only for the ontology parts!

Resolution: no problem, since the ontology part is actually marked "__" = development only. Text/images/citation part: probably nothing is gained by it.


Primer and other documentation

Kevin showed outline of current primer, with different trails. The term "track" was preferred over trails. Resolution: Move the primer the entirely to the SDD Wiki. Desire other people add to the primer. Try to put it an extra discussion page to the primer pages to separate the carefully edited content from free discussions.

The terms: "Primer", "Discussion", "Documentation" should always be part of the Heading.

A discussion about the use of css styles on the WIKI followed.


Name for DataInterface/Proxydata

Discussion on name of ExternalDataInterface. External is misleading! Proposals:

No clear preference could be found at the meeting. Conclusion: the name should be changed and the topic raised on WIKI and email list. See http://efgblade.cs.umb.edu/twiki/bin/view/UBIF/NameForProxyOrInterfaceSection.

Compare also the separate presentations on UBIF (Powerpoint, pdf) and proxy data model (Powerpoint, pdf).


Should images be describable?

In CodedDescription, the Header, first choice, could be extended to be a choice of ClassName, Unit, and (new) MediaResource. This would allow to say explicitly that something is a media resource (image etc.) description. The case where an image is the information source is different from images illustrating the description of e. g. a specimen. Currently in SDD 1.0 beta 2 images in descriptions only treated as illustrations.

In the discussion we considered whether Unit may cover this already, i. e. a "media resource unit" could be the object of description. Since units may be derived from other units, they also may include microscopic slides or photographic prints associated with a specimen in a collection. Accordingly, we concluded that the case of images as the described object is already included in the SDD model. The annotation of the CodedDescription/Header/Unit was changed: "Refers to an individual physical object (e. g., a biological specimen). The unit may be a collected and preserved object or an observation. Furthermore a unit may be derived from other units (a microscopic preparation from a specimen, a picture derived from an observation). The ClassName identification is defined in the ExternalDataInterface/Units list. Note: Since a unit may be a media resource, it is possible to describe a single media resource representation of a biological specimen."

Conclusion: No action necessary.


Review of coding status design

The first topic was the comparison of the use of "unknown", "inapplicable", and "scoped out" in SDD and Lucid 3. Lucid stores "unknown" on each individual state (it combines uncertainty with unknown: if only some states of a character are marked, the SDD equivalent is an uncertainty modifier, if all states are marked, the SDD equivalent is the character coding status "unknown"). Lucid 3 supports a new "scope out" coding status, which is stored on a character level and which is equivalent to SDD "Not to be coded". Lucid does not support any other coding status, especially not "inapplicable" yet. In identification, inapplicable is treated equivalent to unknown in Lucid (Intkey does use inapplicable to infer controlling states, if a character dependency has been defined). Additional coding status values may be supported in future Lucid version to improve the managability of data in the builder application rather than directly improve the identification results.

The next topic was a review whether the current, fully designer-extensible mechanism of coding status is appropriate. From the Lucid presentation it became clear that it would require fewer changes in Lucid to support a fixed number of coding status values, than an unlimited number of terminology-designer defined values. Can we accept coding status values being fixed to enumerated values? This would be analogous to the change already performed in the statistical measures, which in SDD 1.0 have been switched to enumerated values.

In both cases, the fully user (= terminology-designer) extensible design and the flexibility available (and required) in defining new semantics seems to be confusing many people approaching SDD. To define the coding status values in one part of the document/SDD, to be able to use them in the descriptions seems unintuitive to some user. Indeed, existing DELTA or Lucid design, and even databases in general (Null value or Not-a-number value) provide system-defined, fixed sets of missing-data or coding-status values.

The intended pattern of using both these extensible features was that all users would pick up a common template file for integration into their own terminologies, and to extend this template as needed. From the test implementations, we learned with certainty, that this pattern was not followed for the statistical measures. This may certainly have been a result of insufficient documentation, but it may also be a result of split expertise: in contrast to the definition of characters, states, and structural or methodological concepts, the definition of statistical measures or coding status values is not normally the expertise of the typical terminology designer (= a biologist). Application developers would have to provide wizards and use templates to bridge this gap. The conclusion for statistical measures was already that an enumeration is feasible. So far the feedback about the enumeration was much more positive than about the old design. Should a similar change be made for Coding status values?

The SDD documentation on coding status (version 2) discusses specifically only the following coding status values:

Discussion: Kevin: add CodingStatus "To be checked" = "Coded, but should be revisited". Is this different from RevisionStatus? Two situations can be distinguished: a) the coding has been completed but needs revision. In principal, the intended feature for this situation is RevisionStatus - however RevisionStatus acts on the entire description, not on individual characters in a description! It could be extended to individual characters (but no decision could be found at the meeting). b) only some information has been entered in a character (not all states scored, or modifier/notes not yet added). For this purpose the "Unfinished work" indicator is intended (if combined with existing data). Thus "to be checked" is an alternative label to "unfinished work".

Discussion: "Unknown" may be too wide and easily misinterpreted. In SDD it is intended to refer to the situation that information is thought to exist (no reason is known that it cannot exist, i. e. the character is inapplicable), but could not be found. Kevin proposed to rename "unknown" to "not found", other proposals were "not available". Test: coding status terms should be readable in combination with character labels, similar to states.

Testing coding status with character labels, for three examples, including the two alternative wordings for coding status values discussed. The test phrases are shown in a combination with an actual state value because this underscore the intuitiveness more clearly:

Leaves
(e. g. present/absent)
  Leaf shape
(e. g. oval/linear)
  flower color
(e. g. red/yellow)

Leaves present or unfinished work Leaf shape oval or unfinished work flower color red or unfinished work
Leaves present or to be checked Leaf shape oval or to be checked flower color red or to be checked

Leaves present or not to be coded Leaf shape oval or not to be coded flower color red or not to be coded
Leaves present or not applicable Leaf shape oval or not applicable flower color red or not applicable
Leaves present or not interpretableLeaf shape oval or not interpretableflower color red or not interpretable

Leaves present or unknown Leaf shape oval or unknown flower color red or unknown
Leaves present or not found Leaf shape oval or not found flower color red or not found
Leaves present or not available Leaf shape oval or not available flower color red or not available
Leaves present or data not found Leaf shape oval or data not found flower color red or data not found
Leaves present or data unavailable Leaf shape oval or data unavailable flower color red or data unavailable

Conclusions: The phrase "to be checked" seems to be preferable over "unfinished work". The phrases "not found" or "not available" in the case of a presence character sound as a synonym to "leaves absent". Even though very wide, the phrase "unknown" seems preferable. Editor's note: it seems that the addition of "data" to not available or "not found" clarifies the confusion, and I propose to use one of these.

Editor's comment (and apology - esp. to Steve Shattuck!) on the decision to "go back" to enumerated coding status values: I brought the topic up at the meeting for mainly two reasons: 1. The two terminology parts that are very system-close and the semantics of which largely have to be interpretable by applications, namely statistical measurements and coding status values now used different methods. After the response to the measurement redesign presented in SDD 1.0 was uniformly positive, I think SDD is much easier to learn and apply if a similar solution is used for coding status values. 2. The coding status feature never received testing from any SDD implementation test, and the attempt to create a coding status system that is both fully redefinable and provides enough semantics (through the secondary required enumerations "basic coding status" and "present of information") was never reviewed or commented upon - despite my repeated pleas to do so. Since I consider the semantic layer vital in the case of coding status values (the status values indicate special situation, to which applications should respond appropriately), I felt very uneasy to release the untested system. -- Finally, I believe that it is possible to "open up" the system in future versions of SDD, if required so. The only necessary backwards compatibility is that certain id-values would be fixed to the semantics defined in SDD 1.0.


Revision of modifiers (applicability to state versus character)

Kevin thinks that "certainty" should be applicable to individual states as well as to entire characters. Currently only frequency and state-specific characters can be applied to individual states ("usually weakly pointed, sometimes strongly pointed").

Three groups of modifiers can be distinguished:

  1. Some modifiers could be expressed as different characters: Developmental Stage / temporal and Spatial modifiers.
  2. Certainty-modifiers ("perhaps, "probably"), and approximation-modifiers ("ca.", "about") affect values of states in general (not specific to categorical data).
  3. Some modifiers are relatively specific to categorical values (states). These are:

Conclusion: Remove modifiers from Character abstract base type. For Categorical characters add modifiers only at the state level. For Quantitative characters add modifiers before the measures (= roughly at the same place where they would have been when inherited from Character abstract base type).


Repeating characters within a single description container

A new feature of SDD 1.0 is that the occurrence of a character in a description no longer needs to be unique. This allows to express measurement summary data from multiple samples, or in cases where an aggregation is not desirable (e. g. in a genus description based on two species, 1: petals 2-4 mm, 2: petals 20-28 mm).

The multi character design seems to simplify many data aggregation use cases. Data aggregation (from different federated data sources, from specimens to taxa, from lower-rank taxon descriptions to higher taxa) is very central to the design of SDD. Note that application software can still highlight or warn about the presence of character duplicates to avoid accidental use of the feature.

In the case of categorical states, multiple characters in a single descriptions are only required if a combination of and-states and or-states in a single character is intended (using the optional "Model" element before the State elements in a description). After the changes decided at the meeting, categorical character have all modifier information on individual states, reducing the use of multiple characters . But to express different modifiers on quantitative/numeric or other future character types (shape through function, etc.), it probably should remain a general feature of all SDD character types. We need to learn more about this new format first.

One consequence of this change is that the boolean rules in descriptions need to be restated:


Explicit statement of the boolean rules underlying characters and states in a description

Categorical and quantitative information in SDD descriptions is distributed in multiple structural containers. In general some rules are used which data are assumed to be combined by "and" and "or". (I, Gregor, am not certain whether the DELTA rules have been made explicit. I do not remember having read it explicitly in the users guide, please help if you can point us to the page.)

The usual DELTA rules are: states in different characters are assumed to be both present (combined with "and"), states within a single character are may both be present or present as alternatives (combined with "and" or "or"). DELTA makes this distinction using the two operators "/" for "or" and "&" for "and". In the case of combinations ("1/2&3/4&5") to my knowledge the precedence is undefined in DELTA. SDD makes a similar distinction using the optional Model element used to express And, Or, Between, and Set versus Sequence of sets.

However, due to the possibility of multiple character containers for a single character term (see section above) in a single description, SDD also needs to define the implied operator between such characters:

Within a single description:
Different characters: "Char 1":"state 1" and "Char 2":"state 1"
Multiple instances of same character: "Char 1":"state 1" or "Char 1":"state 2"
Multiple states within a character container:
  No Model statement: "Char 1": ("state 1" or "state 2")
  No And-Model statement: "Char 1": ("state 1" and "state 2")


Problem with Audience/Language optionality and schema validation

The label should be a unique identifier for characters. No two characters should have the same label in the same language. If this is not the case, human users have to see and operate on the technical id rather than a human-readable label. This is part of SDD validation. Invalid example:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 2, Representation language="en" audience="1": Number of legs
Valid examples are:
audience differs:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 2, Representation language="en" audience="2": Number of legs
language differs (the term "legs" may just accidentally exist in different languages and denote different things):
  Char 1, Representation language="de" audience="1": legs
  Char 2, Representation language="ro" audience="1": legs

The purpose of audiences is in addition to language/culture specific labels (already included in the language attribute):
  Char 2, Representation language="en-us" audience="1": scab
  Char 2, Representation language="en-uk" audience="1": black leg
to also support differences between audiences of difference expertise (especially student/trained professional like custom staff/expert):
  Char 2, Representation language="en" audience="1": flowers
  Char 2, Representation language="en" audience="5": Inflorescence

Now the problem: in xml schema it is possible to define uniqueness constraints, so that the label within a language/audience combination must be unique, i. e. for a given language / audience combination no two characters can have the same label. However, it turns out that this does not work with optional audience attributes. Even though the audience has a default value, if the attribute is missing the uniqueness constraint is not evaluated by many parsers (I don't know whether this is a bug or by design).

Possible solutions: a) ignore the problem and let external validation handle it, b) make all audience attributes required, or b) limit the uniqueness to language alone. SDD 1.0 beta 2 tended to use option a), but Gregor now thinks that perhaps option c) is not perhaps more appropriate. If an audience is not available, the recommended application behavior is to use the nearest available audience. So instead of using the same label for different audiences:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 1, Representation language="en" audience="2": Number of legs
  Char 1, Representation language="en" audience="3": Number of legs
a single label could be defined, thus not violating the uniqueness of labels within a language:
  Char 1, Representation language="en" audience="1": Number of legs
Problem may occur if two sets of labels for more than two audiences are to be defined. However, even:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 1, Representation language="en" audience="2": leg count
  Char 1, Representation language="en" audience="3": leg count
can be defined as:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 1, Representation language="en" audience="2": leg count
Since audience 3 is closer to 2, audience 2 would be used (note that this is a simplification, according to the recommendation not the audience ID is evaluated, but the expertise level in the audience definition). The only situation not expressible is:
  Char 1, Representation language="en" audience="1": Number of legs
  Char 1, Representation language="en" audience="2": leg count
  Char 1, Representation language="en" audience="3": Number of legs
which seems rather artificial.

Conclusion: attempt to increase the uniqueness, considering audience only informational, but not part of the uniqueness constraint. This will be included into the next version of the SDD schema. If practical problems occur, it may be changed with relatively little problems later, since SDD 1.0 would remain valid under SDD 2.0.


PS: In a discussion during the meeting another potentially problematic point was raised by Lynn Kutner: although SDD supports incremental markup in NLD, there is no indication who did which markup! This is especially important if markup is partially by software agent! In the CodedDescription case, different "IPR" is treated by handling the data in multiple descriptions, but this is not possible when incrementally marking up existing natural language descriptions. I find this point worrysome, since it may indicate that the fundamental model may have to be changed. If it is desirable to have a markup-authorship feature in NLD, it may be logical to provide something similar in Coded Descriptions


Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)



Return to the SDD starting page.

First published 2004-10-30, last update: 2004-12-10.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser