(Version 0.9)
A primary goal of this year's meeting was to advance SDD on the standards track. The procedures of TDWG require that in order to vote on a standard, the TDWG meeting in the preceding year has to review the standard. 60 days prior to the meeting a review version must circulated. Accordingly, SDD version 1.0 beta 2 was published in due time and a call for review was made on several email lists. At the meeting in Christchurch itself a plenary session (Tuesday afternoon) was dedicated to the public review and criticism. The plenary review was favorable with little criticism.
However, in discussion during the main meeting in the two dedicated days of SDD workgroup meetings, important points were raised that need to be addressed. A focus of the meeting was to compare the current features of SDD with the features required by Lucid 3 and the CIPRES project. This meeting document first contains a protocol of the plenary session, and then minutes of the SDD workgroup meetings.
Note: 180 days before next years meeting on which SDD is up for voting, the changes to the standard have to be finalized. This date is approximately March/April 2005!
The updated and most recent version of the schema that resulted from the discussions can currently be found on the WIKI under CurrentSchemaVersion.
Tuesday afternoon, 12. October 2004 - Plenary TDWG Session to review SDD standard proposal
Friday morning, 15. October 2004 - "Cafeteria meeting" with focus on the phylogenetic CIPRES project
Saturday, 16. October 2004 SDD Working group session, day 1
Sunday, 17. October 2004 SDD Working group session, day 2
(Protocol by Shirley Cohen - many thanks to her! The text is slightly edited by Gregor Hagedorn, all errors are his!)
Schedule:
14:30 Introduction to SDD (Gregor Hagedorn, BBA, Germany)
15:10 SDD support tools (Jacob Asiedu & Robert A. Morris, Univ. of Mass. at Boston)
(15:30 Coffee)
16:00 Example instance documents (coded description) (Robert A. Morris, Univ. of Massachusetts at Boston)
16:20 Abstract characters and modifiers (Gregor Hagedorn, BBA, Germany)
16:40 Open discussion - "Your FAQs"
17:30 Close
The following presentations are available as powerpoint files:
Note: Some background to the use of UBIF the UBIF proxy data proposal can be found in the following presentations held in other sessions at TDWG:
(Available as 0.9 MB powerpoint and 8 MB pdf.)
What are descriptive data? "Descriptive data inform about the state of repeatably observable, inherent properties of objects (= individual organisms) and classes (= taxon)". Descriptive data are mostly about morphological data. They have a long history in biology. The definition of descriptive data includes "chemical/enzymatic features or molecular data". The driving force is identification. Relevant are hierarchies of phylogenetic relations. Descriptive data are the mapping between the real world and abstract world. This mapping can be called the "library of life" or "key to life". Better names are most welcome!
Different types of descriptive data exist. First, there is a need for terminology. Coming up with a terminology requires expertise and experience. There is a split between the terminology that is fundamentally defined and the selection of a preferred terminology. The preferred terminology in SDD is referred to as an operational terminology (definition of characters, states, etc.). The standard keeps operational terminology short and all other terms are defined only in a glossary. Operational terminology allows you to add illustrations and refer for more information to the fundamental terminology (glossary entries). Glossary entries are not mandatory. "Concept" trees represent hierarchies of characters (including their states). In addition, SDD allows you to make assumptions about characters. It has multilingual support. You only have to translate the terminology, then you have a multilingual description.
Once you have a terminology, you can then create descriptions. This process can be conceptualized as creating a data matrix of character x taxon item whereby every cell in the matrix can have multiple values or observations. Another way to conceptualize this is a questionnaire form. A form restricts the way you can answer questions. It gives you some structure, simplifying the interpretation of answers.
Another form of description is natural language. This is not ideal because it can lead to errors of interpretation. There is too much freedom associated with this form of description. What are needed are tools to produce higher-level quality data. DELTA and NEXUS-based tools are examples. Moreover, once you have structured data, there is no reason you can't generate the natural language descriptions from the structured data.
Delta has a large user base. However, it has legacy problems and 170 directives. It has some principal limitations. In 1999, there was a complete revision of Delta. It was concluded that Delta needed a complete redesign, which led to the start of SDD. SDD is currently at version 1.0 beta. SDD remains complex. One possibility is to have a light-version of SDD that would make it easier for most people to code against. SDD is backwards compatible to Delta and Lucid. Except for selected rejected concepts DELTA and Lucid data should be fully expressible in SDD.
SDD Design Goals: The schema is strongly-typed. It has very few anonymous types. It is close to an object-oriented schema. The inheritance mechanisms are limited, but the schema allows for extensions. SDD is less concerned with human readability and more concerned with the relationships between objects. SDD should not be exclusively bound to the biological domain. That is why we use the term "class" and not "taxon", and object instead of specimen or observation.
SDD should be in one complete format and be able to manage data during the analysis process as well as expressing final, fully revised work. It should be able to support sample values. It support images. The form structure should limit one's ability to express oneself using natural language. DELTA in some respects required too much to express in annotations and comments. In SDD, many of these annotations have structure through the use of modifiers.
SDD is a UBIF application. The metadata section will be identical to that of ABCD's. Element container: "External Data Interfaces" = interfaces to the outside world, such as taxa, agents, publications. The "class" section contains taxa. Agents are e. g. persons. All are proxies that may refer to external data sources. Within the UBIF general structure, the SDD data are in the DescriptiveData element. The description section consists of: terminology, coded descriptions, natural language, identification key. Configuration section allows setting of defaults.
It would be useful to have a common library of terminology. However, this feature won't be part of this version of SDD. If you allow for modification, you need to have support for character revisions. In 20 years, we hope to have a more stable terminology, but for now, we need to have flexibility. Increased support for reusability of definitions across projects is desirable, but options still under debate.
Stinger asked if there are any formal use cases and Bob said they would produce some in the coming weeks.
(Available as powerpoint.)
Developed 2 SDD support tools: SDD Description Editor and SDD Debug Ref. The editor is used to construct SDD documents. The editor is written in C#. The editor assumes you have a valid terminology in the SDD document. The editor enforces constraints and prevents scoring of states. The editor is still a reference implementation. We haven't decided to turn it into a product yet.
SDD's ids can be seen as primary keys and refs can be seen as foreign keys. These ids are constrained by XPath, e. g. they have to be unique within a document. The Debug Ref tool has several modes of operation. Showed demo. Projects are supported by Boston Electronic Field Guide.
(Available as powerpoint.)
Showed us what an SDD document looks like through XMLSpy. Browsers are decent viewers. XML should be produced by exchange tools such as a database application, not by hand. Most of the schema sections are optional. Introduced NSF project "Ants of the American Museum Congo Expedition". How good is SDD as a vehicle for answering questions using XQuery?
SDD doesn't prescribe how to make a description. SDD is a meta schema. Its purpose is to constrain how to write descriptions. Showed 2 examples of SDD documents from the Ant project. SDD editor wouldn't let you score a thorax shape as blue. However, an XML schema could. Therefore, there is value in having an SDD editor like Jacob's.
Concept trees are ways to organize states (e. g. body parts such as head, eyes, mouth).
Introduction of XQuery: A recently standardized language for searching through XML documents. Existing XQuery engines: Berkeley DB.XML. Bob's student did a study on how well the different XQuery engines perform. Results will be posted on the Wiki.
(Available as powerpoint and pdf; this is the same file referenced for the first talk.)
What is a character? Characters can be categorical, quantitative, numeric, statistical, range (e. g. color range). SDD allows you to define a new character type. SDD needs to support molecular types. Character shapes could be defined by a mathematical function. Abstract characters versus concrete characters. Process for creating a new character type: Create 1) definition type 2) coded data type 3) markup text type and 4) perhaps sample data type.
There may be a mapping between character types. These mappings are defined in the terminology of a character.
Modifiers act on a statement. Types of modifiers: frequency, probability, spatial, temporal, degree/kind.
Asiedu, Jacob (Univ. Boston) kasiedu-at-cs.umb.edu
Cohen, Shirley (Texas Advanced Computing Center, Austin, CIPRES)
Guralnick, Robert (University of Colorado)
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Miranker, Daniel (University of Texas at Austin, CIPRES)
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Thiele, Kevin (K.Thiele-at-cbit.uq.edu.au)
This unscheduled meeting was held because Shirley Cohen and Dan Miranker, both working on the CIPRES project, could not attend the SDD workgroup discussions on the weekend. The main purpose of the meeting was to increase the knowledge of the SDD about CIPRES and vice versa.
Shirley Cohen explained the use case of adding characters/states information to regions of interest (ROI) either as free-form text annotations (like arrow pointing from text to feature) and as analytical entities (as defined in SDD characters or concepts) to 3-dimensional images (scanning project, PI Julian Humphries).
Dan Miranker outlined some of the architectural plans for CIPRES. CIPRES is based on matrix data, mostly from the Treebase project. It then uses matrix data for phylogenetic reconstructions. Treebase is currently using NEXUS, but as part of CIPRES it is planned to convert Treebase to a new "Treebase 2" structure having richer representations. In this context, and improved data format would be desirable. Data in current Treebase are often big string fields with relatively little structure. A new version of Nexus ("NexML" - great name) is considered, but also SDD is considered as a link between character databases and the central storage. The preliminary assessment of SDD by CIPRES was favorable.
The relation between SDD and NEXUS was discussed. Especially the Assumptions block in each format needs further comparison. An example is the ordering of states in a tree. This was already considered by SDD in Brazil, but so far not introduced into the schema because we hoped for input from the phylogenetic analysis/NEXUS community.
A major question was, which relatively simple changes/additions could be made to SDD 1.0 so that its utility to the CIPRES project and phylogenetic data analysis community in general in increased. SDD is certainly lacking some features of NEXUS that specifically address phylogenetic analysis procedures. However, no specific recommendations or requests to change SDD could be identified at the moment. The basis of further discussion could be an internal draft design/requirement/road map document by Piel for Treebase2 (or the new NEXUS data model?). Dan and Shirley will ask whether this can be made available to SDD.
One point identified was that although SDD already handles multiple taxon trees in the form of the ClassHierarchy, it does not provide for branch lengths / distance measures. Only the pure topology (but including multifurcations) of a NEXUS tree can be expressed in SDD at the moment. Questions: a) How should this be added to SDD - can we have semantics or do we need a semantic-free floating point value field? b) What would be the appropriate element name, keeping it as general as possible? PhylogeneticBranchDistance? c) Is a single floating point value sufficient?
Also: New Nexus model stores trees as edges - consequences for SDD? Should we give the xml-hierarchy approach up and store our trees in a more relational way as well? The representations are convertible, but the correct decision should be taken now. This is an urgent point to decide! Note that the protocols (Biocase/Tapir) presented in Christchurch also have problems with trees of unlimited depth.
Another issue very important to NEXUS are nucleotide of protein sequences and alignments. Both have two forms: original sequence data and aligned matrix form. The alignment is usually local to a set of sequences prepared for comparison. Global alignments are difficult and rare (exist e. g. in the ARB project). Original sequences are essentially text and easily supported. Unfortunately, since few analyses operate on them directly, and since they are usually vouchered in GenBank, EMBL etc., support is not very important. On the other hand, alignments are important for analysis. Alignments have the property that for one sequence and one taxon, multiple alignments may exist, if the sequence has been compared with multiple, non-congruent sets of taxa. This is strictly neither supported in NEXUS nor SDD.
Since NEXUS uses character strings to store taxon data and identifies character position (= column) with character (except in the case of polymorphic coding), expressing sequence alignments in NEXUS is straightforward. In SDD strictly each sequence position equates a character (the label of which presumable would be just a position number). Breaking down the data like that is desirable for descriptive data where character have significant metadata and where it is desired to manage and rearrange characters. This is the case in morphological data, where in the NEXUS model it would be extremely difficult to manage a matrix of 1000 character, perhaps many with polymorphic coding. However, it is not appropriate for sequences. One reason why this is so it that in sequence data the states at an alignment position are not truly scientifical observations (or measurable), they are the result of a first step of the alignment analysis process. It is often reasonable to insert or delete gaps into some sequences in an alignment, essentially shifting a block of data into the adjacent characters.
Dan Miranker criticized that SDD is too "monolithic" in that it attempts to describe all the relations among its objects. As far as I understood, he would prefer, e. g., terminology and the different kind of descriptions in separate schemata. Gregor's view is that for data to be permanently valuable, both terminology and description need to be combined in some container. Although this does not need to be a single document, it is convenient to allow it in one document, to simplify data archiving and document-like exchange among scientists. However it is recognized that many use case will want to exchange individual objects from the SDD schema. An identification web service may request only the minimal base information for the characters and the preferred concept tree, and later request individual character details, or single descriptions without requesting the terminology again. It remains unclear, how the schema could be reorganized so that both use cases are equally obvious. Gregor's current assumption is that a few lines of xslt code (which should be made available with future distributions of SDD) create an all-optional schema with the identity constraints removed, making it easier to exchange objects as fragments of full SDD.
We agreed that within 60 days (i. e. until December 15.) a request from CIPRES should be made to add certain features that are present in NEXUS or required in other ways by CIPRES, but currently missing in SDD. This request should include features considered general (some features of NEXUS may be more appropriate for application-specific extensions to SDD, using the SDD CustomExtension mechanism).
Shirley will research available options to define regions of interest within 2- or 3-dimensional images (SDD.MediaResources). She is looking at "GML 3", which is the standard that the spatial data people are using and supports volumes.
Asiedu, Jacob (Univ. Boston) kasiedu-at-cs.umb.edu
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chenin, Eric (IRD France, French node manager)
Endresen, Dag Terje (Nordic Gene Bank, POB 41, SE-230 53 Alnarp, Sweden)
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Richardson, Ben (Western Australian Herbarium, Dept. of CALM)
Saddi, Chandramohona (http://www.findingspecies.org/)
Thiele, Kevin (K.Thiele-at-cbit.uq.edu.au)
Ulate, William (INBio Costa Rica)
Wilton, Aaron (Landcare NZ, Lincoln; "eBiota")
Eric Chenin: Species bank GBIF/ENBI workshop ETI in March.
Everyone at the meeting thought the GBIF name "Species bank" should be changed:
Where are the data? Almost always in the hands of commercial publishers. Copyright and database rights issue - database rights is worse since it prevents the recreation of databases. If someone has created a database on a topic, noone else may redo it, not even from scratch. Motion by Gregor to license SDD under condition that data should be made available after reasonable period of time ("10-20 years?") rather than being made available to science only 70 years after death of last collaborating author was not considered relevant and rejected.
Some existing systems:
Problems when an application is changing data. Situations:
Graceful degradation of data is desired in the design of SDD. For example, if modifiers are not supported, the remaining data are likely to remain usable for an identification application only supporting characters and states.
Important clarification: Roundtripping is NOT intended to go from coded to natural language description and back. Natural language description can be created from coded descriptions, but this should be considered "cached information", if it is stored at all. Consumers should not assume that coded descriptions and natural language descriptions have a relationship! The schema gives no reason to assume so.
Bob Morris: Jpg2000 - gives reader advice on required features. Features used in SDD can be detected, but parsing all may be time consuming. Writer could inform about which features have been used in a summary. But this is like caching a previous parsing of existing data. Could be supported by a separate optional schema and stored in CustomExtension in the SDD Metadata element. Is not essential for SDD at the moment.
What do we mean by "SDD light"? Primarily, the issues are social or educational. SDD light is primarily a "SDD light documentation" rather than a schema. Used for communication about respective capabilities! However, non-normative schema files may be created for illustration and documentation purposes.
Other view of SDD light: Advice to consumers and producers about a subset that is reasonable to start with. But don't die on instance documents that have more in it! If so inform user, preferably listing the tag names that have been ignored (see Data interoperability and roundtripping above). Graceful degradation to light versions is largely possible (e. g. removing modifiers). However, dependencies between parts of the schema exist: obvious ones like terminology definition to terminology usage, but also less obvious ones. Example: when ignoring concept trees the feature of state-set reuse is also missing. In an explicit light version this can be removed. Thus the light versions may give advice about recommended "degradations".
Should SDD light also remove validation constraints from the schema? It would be possible to have less validation / facet / identity constraint - recommendation to test any SDD-light export under SDD full. This would remove that SDD light is always valid under full. It would be better to uphold this! Conclusion: this should not be the same issue as "light". Low validation for reduced documentation size or as an intermediate schema during application testing is a different issue and can be done separately.
Aside, an important recommendation / clarification: On successive exports, the IDs to the same object (e. g. character) should be identical.
Perhaps even an "SDD super light": the "hello world" of SDD?
=== Terminology ===
=== CD ===
=== KEY ===
Bob will produce a data set of "hello world", "super light encoded". Maybe also identification key. It would be good to create a "hello NLD world" - who volunteers?
ETI is principally interested in SDD, but SDD is not considered helpful to increase income of ETI - therefore no resources can be invested.
Editor's note: The old Linnaeus used free-form formatted text (without semantic or structured markup) for descriptions, and separately NEXUS matrices for categorical information used in interactive multiple-entry identification. No information regarding features required for Linnaeus but perhaps missing in SDD was available at the meeting.
Rob: Species bank: Wouter Addink is responsible
Kevin Thiele gave a very impressive demonstration of Lucid 3 features, from which several discussion topics spun off. Illustrations and character images are shown below, another discussion of modifiers was started, but continued on Sunday.
Kevin: desirable to illustrate individual states in descriptions, in addition to the existing SDD ability to associate a media resource with an entire description or a character in a description.
(Clarification: in character and state definition, it is possible to associate media with states; the current discussion refers to a specific illustration, like the stamen shape in a specific species. The question is whether it is sufficient to say: these images illustrate stamen shape in this species, or whether it is necessary to say: this image illustrates this shape, this image illustrate that shape.).
We did not have full agreement on the importance of allowing images to be specific to states rather than the character in a given taxon. However, Kevin agreed that it is important to associate an image with an entire character, esp. if the image illustrates all or many states. Thus the discussion is not about either character or state, but rather that character images in individual descriptions are agreed upon, and that Kevin asks to also support state-specific images. Therefore, the first version of SDD may limit itself to character-description, and state-specific description images can be added in later versions as extensions, without breaking existing software.
Editor's note: In the hierarchical schema MediaResources are currently added in two places: at the description, and at individual characters (plus also in terminology). A relational implementation could instead create an entity DescriptionMediaResources, with required media resource and description id foreign keys and optional character id foreign key. Both a model following the hierarchical schema and the relational entity could easily be extended to make an illustration specific to a state in a specific description.
| Current SDD terminology: | |
| Concept: | RichLabel - with Wording, icon, images |
| Character: | SimpleLabel - no icon etc.! |
| State: | RichLabel - with Wording, icon, images |
Background information: the character labels in a context tree are not character, but tree-specific. For each character x concept tree combination, additional character labels have to be stored in the trees. Example:
Concept Tree 2
Colors
Leaf - charid=2
Concept Tree 2
Leaf
Color - charid=2
To avoid the introduction of special character x concept-tree node-types to represent the terminal nodes (leaves) in the concept tree, each character reference in a concept tree is modeled through a concept node. This node, instead of being empty or containing further nodes, contains a single character reference. Thus, concept objects provide for each character in each tree rich formatting information, including wording, icons, images.
If both the concept terminal node around the character and the character itself provide rich icon/wording information, rules of precedence would have to be formulated. If an Icon/Image is only present at Concept, this would be used. If an Icon/Image is only present at Character, this would be used. If both are present, the Concept could take precedence. However, if an icon is present at Concept and other selector images at characters, a decision would have to be taken whether precedence applies to the whole rich label object, or to its individual data items.
The purpose of leaving the character a relatively abstract object (with only label) without rich formatting features is to remove the potential duplicity of where the relevant data may be. This duplication is considered potentially confusing to both content creators and application developers and viewed as a potential source of interoperability problems.
On the other side it may be considered more systematic to provide identical features for all kinds of objects. However, this argument would have to be thought through for the case of measures, coding status states, etc. as well. Proposals/ a discussion document on this are welcome! For the time being we decided to start with the character being restricted, under the assumption that it may be easier to extend the richness of the schema in later versions, rather then restrict it.
It seems desirable to handle objects from other domains (like taxon names) and objects from the domain of descriptive data (e. g., a character from an external terminology) symmetrically (or similarly) regarding internal/external linking. However, currently they are handled differently. One major difference between the two cases is that the external data can be abstracted and only a short set of "interface data" is used, whereas the descriptive data objects are required in full:
| External data source | Internal (in dataset) | |
|---|---|---|
| Taxa | SDD: ClassName proxy | |
| = full data (e. g. LinneanCore or TCS) |
= only short interface, especially Label, Link, Rank | |
| Character | Character | |
| = rich data | = rich data |
We all agreed that the decision in SDD to currently treat descriptive and non-descriptive objects differently is not ideal. However, nobody could propose a truly better general architectural plan. Some attempts to solve the problem linking problem and how to use the GUIDs which would in any case be the backbone of a solution are discussed below:
GUIDs/LSIDs on entire terminology or on individual objects like character/states? How to structure? Examples:
--- External: ---
Terminology GUID=111
Characters
Character id=1
Terminology GUID=222
Characters
Character id=1
--- Local: ---
Local Terminology GUID=444
Characters
Character id=1
Description
Character ref=111/1 (= this refers to external character)
Character ref=1 (= this refers to local character)
Problem: 1. no value-based ref integrity. 2. no cached access. If no internet access is available, a hand-held computer is not able to perform identification. Further problem: desire to extend information in external objects, specifically to add labels in new languages. One option would be to integrate external terminology into local terminology, and allow extensions in the local objects:
--- External: ---
Terminology GUID=111
Characters
Character id=1
Label language="en": "Leaf shape"
State id=1
Label language="en": "elliptic"
Terminology GUID=222
Characters
Character id=1
--- Local: ---
Locally integrated Terminology LSID=444
Characters
Character id=111/1
Label language="de": "Blattform"
Characters
Character id=222/1
Characters
Character id=1 (local without LSID)
Description
Character ref=111/1
Character ref=222/1
Character ref=1
With LSIDs, since the data behind an LSID should never change, we need to discuss what happens if the external Terminology is updated. If character xml-data are considered data behind the LSID, a new LSID is required - at least on revision level. Acceptance can be under manual control or rule-based - review and accept/reject. Question: Could instead all character other than the id be considered metadata in the LSID sense? Then no LSID would have to be issued, if the editors consider a change minor (e. g. an orthographic correction in a label), but a new version could be issued if necessary.
Work flow when internal integrated Terminology is extended - e. g. label in a new language is added. If prefix to id is identical to local LSID, I have different rules, I may make changes, but perhaps have to increase version part. Now I want from above to add new German labels. I have to create a copy of the original shared, and locally edit that.
Terminology GUID=999
Characters
Character id=1
Label language="en": "Leaf shape"
Label language="de": "Blatt form"
State id=1
Label language="en" "elliptic"
Label language="de" "elliptisch"
Fundamental scenarios of sharing terminologies
What are the properties?
Fundamental question: Are links references to the remote objects, or are they copies?
The sequence of importing/linking to external terminology and changing/extending it, can be complicated. Example:
Day: Action:
1 Take char 1, 2 3
2 add german labels to local representation of linked ones
2 create local characters 4-9
3 get additional external characters 10, 12
4 add german labels to those
4 improve phrasing of label for fully local and locally extended characters
The argument here is that with large and complex character terminologies (e. g., 800 characters/5000 states is not unusual) a simple document-based CVS system can not be the solution. Although the CVS perfectly allows to track any changes, a relatively straightforward desire to propagate improvements" - e. g. german labels added - back to provider to be considered for integration may be impractical. A relatively straightforward request like: "Which extensions were made to all characters obtained in the last 2 years from terminology provider A? List these and inform the provider to consider for integration!" may require thousands of changes (provided the xml-file has been entered every day into the CVS to be analyzed manually.
Another problem: currently there is no "locked" terminology mechanism in SDD. You also cannot easily know which parts come from external source, and which from internal!
One solution could be to add 2nd attribute to explicitly reference something from external terminology? Solution 1: simply add information about the source Terminology: Character id="1" sourceterminology="urn:lsid:...." sourceid="2"
Solution 2: Duplicate, keep a proxy/cache of the original TerminologyInterface: ExternalCharacter sourceterminology="urn:lsid:...." sourceid="". Plus then in the integrated terminology have a locally modified version?
No final solution of decision could be found.
Proposal by Ben Richardson: add to terminology objects an attribute like: deprecatedOn="-datevalue-" (or "deprecated_since"?). If in public shared terminologies concepts have to be changed in a new version in ways that are no longer backward compatible, a new character will have to be created, but it is usually undesirable to remove the old character immediately. Instead it could be "deprecated", i. e. existing data may still refer to the character, but the authors of the terminology recommend against using it for new data and indicate that the character may be removed in the future.
Important question: how important is the date? It is useful but would require an additional attribute. The status of deprecated itself would fit into the pattern of RevisionStatusEnum, by adding a value past "FullyRevised"! On the other hand, some uses of revision status may not desire the over-the-edge level of deprecated. Also, both term and concept may not be known to many people who are comfortable with the other levels of revision status. This seems to indicate that separate handling may be appropriate.
(Note: the day continued first with the discussion on sharing terminologies, which in its entirety is reported under Saturday, above)
"Bob now asserts that you need LSIDs for all objects in terminology if this is how you are going to reference external - LSIDs should be semantically opaque."
Result: Link/LSID changed from LSIDBody type (i.e. excluding the constant prefix 'urn:lsid:') to full LSID always including it, despite being constant. This change increases file size and amount of data to be transferred, but it simplifies interaction with agents that infer the type from a URI string rather than having knowledge about an element being typed. Compare the discussion on the WIKI under UBIF.GuidLinking.
Can Glossary be built on an extension of OWL? How?
Bob/Dave Thau:
Does OWL support following Glossary features (or be extended to do so?)
a) multilingual text
b) sensu label
c) images/media resources
d) citation
Bob: OWL intended only for the ontology parts!
Resolution: no problem, since the ontology part is actually marked "__" = development only. Text/images/citation part: probably nothing is gained by it.
Kevin showed outline of current primer, with different trails. The term "track" was preferred over trails. Resolution: Move the primer the entirely to the SDD Wiki. Desire other people add to the primer. Try to put it an extra discussion page to the primer pages to separate the carefully edited content from free discussions.
The terms: "Primer", "Discussion", "Documentation" should always be part of the Heading.
A discussion about the use of css styles on the WIKI followed.
Discussion on name of ExternalDataInterface. External is misleading! Proposals:
No clear preference could be found at the meeting. Conclusion: the name should be changed and the topic raised on WIKI and email list. See http://efgblade.cs.umb.edu/twiki/bin/view/UBIF/NameForProxyOrInterfaceSection.
Compare also the separate presentations on UBIF (Powerpoint, pdf) and proxy data model (Powerpoint, pdf).
In CodedDescription, the Header, first choice, could be extended to be a choice of ClassName, Unit, and (new) MediaResource. This would allow to say explicitly that something is a media resource (image etc.) description. The case where an image is the information source is different from images illustrating the description of e. g. a specimen. Currently in SDD 1.0 beta 2 images in descriptions only treated as illustrations.
In the discussion we considered whether Unit may cover this already, i. e. a "media resource unit" could be the object of description. Since units may be derived from other units, they also may include microscopic slides or photographic prints associated with a specimen in a collection. Accordingly, we concluded that the case of images as the described object is already included in the SDD model. The annotation of the CodedDescription/Header/Unit was changed: "Refers to an individual physical object (e. g., a biological specimen). The unit may be a collected and preserved object or an observation. Furthermore a unit may be derived from other units (a microscopic preparation from a specimen, a picture derived from an observation). The ClassName identification is defined in the ExternalDataInterface/Units list. Note: Since a unit may be a media resource, it is possible to describe a single media resource representation of a biological specimen."
Conclusion: No action necessary.
The first topic was the comparison of the use of "unknown", "inapplicable", and "scoped out" in SDD and Lucid 3. Lucid stores "unknown" on each individual state (it combines uncertainty with unknown: if only some states of a character are marked, the SDD equivalent is an uncertainty modifier, if all states are marked, the SDD equivalent is the character coding status "unknown"). Lucid 3 supports a new "scope out" coding status, which is stored on a character level and which is equivalent to SDD "Not to be coded". Lucid does not support any other coding status, especially not "inapplicable" yet. In identification, inapplicable is treated equivalent to unknown in Lucid (Intkey does use inapplicable to infer controlling states, if a character dependency has been defined). Additional coding status values may be supported in future Lucid version to improve the managability of data in the builder application rather than directly improve the identification results.
The next topic was a review whether the current, fully designer-extensible mechanism of coding status is appropriate. From the Lucid presentation it became clear that it would require fewer changes in Lucid to support a fixed number of coding status values, than an unlimited number of terminology-designer defined values. Can we accept coding status values being fixed to enumerated values? This would be analogous to the change already performed in the statistical measures, which in SDD 1.0 have been switched to enumerated values.
In both cases, the fully user (= terminology-designer) extensible design and the flexibility available (and required) in defining new semantics seems to be confusing many people approaching SDD. To define the coding status values in one part of the document/SDD, to be able to use them in the descriptions seems unintuitive to some user. Indeed, existing DELTA or Lucid design, and even databases in general (Null value or Not-a-number value) provide system-defined, fixed sets of missing-data or coding-status values.
The intended pattern of using both these extensible features was that all users would pick up a common template file for integration into their own terminologies, and to extend this template as needed. From the test implementations, we learned with certainty, that this pattern was not followed for the statistical measures. This may certainly have been a result of insufficient documentation, but it may also be a result of split expertise: in contrast to the definition of characters, states, and structural or methodological concepts, the definition of statistical measures or coding status values is not normally the expertise of the typical terminology designer (= a biologist). Application developers would have to provide wizards and use templates to bridge this gap. The conclusion for statistical measures was already that an enumeration is feasible. So far the feedback about the enumeration was much more positive than about the old design. Should a similar change be made for Coding status values?
The SDD documentation on coding status (version 2) discusses specifically only the following coding status values:
Discussion: Kevin: add CodingStatus "To be checked" = "Coded, but should be revisited". Is this different from RevisionStatus? Two situations can be distinguished: a) the coding has been completed but needs revision. In principal, the intended feature for this situation is RevisionStatus - however RevisionStatus acts on the entire description, not on individual characters in a description! It could be extended to individual characters (but no decision could be found at the meeting). b) only some information has been entered in a character (not all states scored, or modifier/notes not yet added). For this purpose the "Unfinished work" indicator is intended (if combined with existing data). Thus "to be checked" is an alternative label to "unfinished work".
Discussion: "Unknown" may be too wide and easily misinterpreted. In SDD it is intended to refer to the situation that information is thought to exist (no reason is known that it cannot exist, i. e. the character is inapplicable), but could not be found. Kevin proposed to rename "unknown" to "not found", other proposals were "not available". Test: coding status terms should be readable in combination with character labels, similar to states.
Testing coding status with character labels, for three examples, including the two alternative wordings for coding status values discussed. The test phrases are shown in a combination with an actual state value because this underscore the intuitiveness more clearly:
| Leaves (e. g. present/absent) | Leaf shape (e. g. oval/linear) | flower color (e. g. red/yellow) | ||
|---|---|---|---|---|
| Leaves present or unfinished work | Leaf shape oval or unfinished work | flower color red or unfinished work | ||
| Leaves present or to be checked | Leaf shape oval or to be checked | flower color red or to be checked | ||
| Leaves present or not to be coded | Leaf shape oval or not to be coded | flower color red or not to be coded | ||
| Leaves present or not applicable | Leaf shape oval or not applicable | flower color red or not applicable | ||
| Leaves present or not interpretable | Leaf shape oval or not interpretable | flower color red or not interpretable | ||
| Leaves present or unknown | Leaf shape oval or unknown | flower color red or unknown | ||
| Leaves present or not found | Leaf shape oval or not found | flower color red or not found | ||
| Leaves present or not available | Leaf shape oval or not available | flower color red or not available | ||
| Leaves present or data not found | Leaf shape oval or data not found | flower color red or data not found | ||
| Leaves present or data unavailable | Leaf shape oval or data unavailable | flower color red or data unavailable | ||
Conclusions: The phrase "to be checked" seems to be preferable over "unfinished work". The phrases "not found" or "not available" in the case of a presence character sound as a synonym to "leaves absent". Even though very wide, the phrase "unknown" seems preferable. Editor's note: it seems that the addition of "data" to not available or "not found" clarifies the confusion, and I propose to use one of these.
Editor's comment (and apology - esp. to Steve Shattuck!) on the decision to "go back" to enumerated coding status values: I brought the topic up at the meeting for mainly two reasons: 1. The two terminology parts that are very system-close and the semantics of which largely have to be interpretable by applications, namely statistical measurements and coding status values now used different methods. After the response to the measurement redesign presented in SDD 1.0 was uniformly positive, I think SDD is much easier to learn and apply if a similar solution is used for coding status values. 2. The coding status feature never received testing from any SDD implementation test, and the attempt to create a coding status system that is both fully redefinable and provides enough semantics (through the secondary required enumerations "basic coding status" and "present of information") was never reviewed or commented upon - despite my repeated pleas to do so. Since I consider the semantic layer vital in the case of coding status values (the status values indicate special situation, to which applications should respond appropriately), I felt very uneasy to release the untested system. -- Finally, I believe that it is possible to "open up" the system in future versions of SDD, if required so. The only necessary backwards compatibility is that certain id-values would be fixed to the semantics defined in SDD 1.0.
Kevin thinks that "certainty" should be applicable to individual states as well as to entire characters. Currently only frequency and state-specific characters can be applied to individual states ("usually weakly pointed, sometimes strongly pointed").
Three groups of modifiers can be distinguished:
Conclusion: Remove modifiers from Character abstract base type. For Categorical characters add modifiers only at the state level. For Quantitative characters add modifiers before the measures (= roughly at the same place where they would have been when inherited from Character abstract base type).
A new feature of SDD 1.0 is that the occurrence of a character in a description no longer needs to be unique. This allows to express measurement summary data from multiple samples, or in cases where an aggregation is not desirable (e. g. in a genus description based on two species, 1: petals 2-4 mm, 2: petals 20-28 mm).
The multi character design seems to simplify many data aggregation use cases. Data aggregation (from different federated data sources, from specimens to taxa, from lower-rank taxon descriptions to higher taxa) is very central to the design of SDD. Note that application software can still highlight or warn about the presence of character duplicates to avoid accidental use of the feature.
In the case of categorical states, multiple characters in a single descriptions are only required if a combination of and-states and or-states in a single character is intended (using the optional "Model" element before the State elements in a description). After the changes decided at the meeting, categorical character have all modifier information on individual states, reducing the use of multiple characters . But to express different modifiers on quantitative/numeric or other future character types (shape through function, etc.), it probably should remain a general feature of all SDD character types. We need to learn more about this new format first.
One consequence of this change is that the boolean rules in descriptions need to be restated:
Categorical and quantitative information in SDD descriptions is distributed in multiple structural containers. In general some rules are used which data are assumed to be combined by "and" and "or". (I, Gregor, am not certain whether the DELTA rules have been made explicit. I do not remember having read it explicitly in the users guide, please help if you can point us to the page.)
The usual DELTA rules are: states in different characters are assumed to be both present (combined with "and"), states within a single character are may both be present or present as alternatives (combined with "and" or "or"). DELTA makes this distinction using the two operators "/" for "or" and "&" for "and". In the case of combinations ("1/2&3/4&5") to my knowledge the precedence is undefined in DELTA. SDD makes a similar distinction using the optional Model element used to express And, Or, Between, and Set versus Sequence of sets.
However, due to the possibility of multiple character containers for a single character term (see section above) in a single description, SDD also needs to define the implied operator between such characters:
Within a single description:
Different characters: "Char 1":"state 1" and "Char 2":"state 1"
Multiple instances of same character: "Char 1":"state 1" or "Char 1":"state 2"
Multiple states within a character container:
No Model statement: "Char 1": ("state 1" or "state 2")
No And-Model statement: "Char 1": ("state 1" and "state 2")
The label should be a unique identifier for characters. No two characters should have the same label in the same language. If this is not the case, human users have to see and operate on the technical id rather than a human-readable label. This is part of SDD validation. Invalid example:
Char 1, Representation language="en" audience="1": Number of legs
Char 2, Representation language="en" audience="1": Number of legs
Valid examples are:
audience differs:
Char 1, Representation language="en" audience="1": Number of legs
Char 2, Representation language="en" audience="2": Number of legs
language differs (the term "legs" may just accidentally exist in different languages and denote different things):
Char 1, Representation language="de" audience="1": legs
Char 2, Representation language="ro" audience="1": legs
The purpose of audiences is in addition to language/culture specific labels (already included in the language attribute):
Char 2, Representation language="en-us" audience="1": scab
Char 2, Representation language="en-uk" audience="1": black leg
to also support differences between audiences of difference expertise (especially student/trained professional like custom staff/expert):
Char 2, Representation language="en" audience="1": flowers
Char 2, Representation language="en" audience="5": Inflorescence
Now the problem: in xml schema it is possible to define uniqueness constraints, so that the label within a language/audience combination must be unique, i. e. for a given language / audience combination no two characters can have the same label. However, it turns out that this does not work with optional audience attributes. Even though the audience has a default value, if the attribute is missing the uniqueness constraint is not evaluated by many parsers (I don't know whether this is a bug or by design).
Possible solutions: a) ignore the problem and let external validation handle it, b) make all audience attributes required, or b) limit the uniqueness to language alone. SDD 1.0 beta 2 tended to use option a), but Gregor now thinks that perhaps option c) is not perhaps more appropriate. If an audience is not available, the recommended application behavior is to use the nearest available audience. So instead of using the same label for different audiences:
Char 1, Representation language="en" audience="1": Number of legs
Char 1, Representation language="en" audience="2": Number of legs
Char 1, Representation language="en" audience="3": Number of legs
a single label could be defined, thus not violating the uniqueness of labels within a language:
Char 1, Representation language="en" audience="1": Number of legs
Problem may occur if two sets of labels for more than two audiences are to be defined. However, even:
Char 1, Representation language="en" audience="1": Number of legs
Char 1, Representation language="en" audience="2": leg count
Char 1, Representation language="en" audience="3": leg count
can be defined as:
Char 1, Representation language="en" audience="1": Number of legs
Char 1, Representation language="en" audience="2": leg count
Since audience 3 is closer to 2, audience 2 would be used (note that this is a simplification, according to the recommendation not the audience ID is evaluated, but the expertise level in the audience definition). The only situation not expressible is:
Char 1, Representation language="en" audience="1": Number of legs
Char 1, Representation language="en" audience="2": leg count
Char 1, Representation language="en" audience="3": Number of legs
which seems rather artificial.
Conclusion: attempt to increase the uniqueness, considering audience only informational, but not part of the uniqueness constraint. This will be included into the next version of the SDD schema. If practical problems occur, it may be changed with relatively little problems later, since SDD 1.0 would remain valid under SDD 2.0.
PS: In a discussion during the meeting another potentially problematic point was raised by Lynn Kutner: although SDD supports incremental markup in NLD, there is no indication who did which markup! This is especially important if markup is partially by software agent! In the CodedDescription case, different "IPR" is treated by handling the data in multiple descriptions, but this is not possible when incrementally marking up existing natural language descriptions. I find this point worrysome, since it may indicate that the fundamental model may have to be changed. If it is desirable to have a markup-authorship feature in NLD, it may be logical to provide something similar in Coded Descriptions
Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)