TDWG working group:
Structure of Descriptive Data (SDD)

Minutes of the SDD meeting in Berlin, Germany, 17-18. May 2004

(Version 1.0)


Summary

The meeting in Berlin was partially funded by GBIF to have a timely review before the submitting date for standards to be voted at TDWG 2004 in New Zealand (the NZ meeting is around 11-18 Oct., but the deadline for submission is 10. August!). The meeting brought together members from three standard groups (SDD, ABCD, TaxonNames) and consequently an important task was to discuss overarching structure and types applicable to multiple standards. Significant progress was made, but we need a review of the improved proposals, and it will ultimately depend on true acceptance!

The updated and most recent version of the schema that resulted from the discussions can currently be found on the WIKI under CurrentSchemaVersion.


Table of Contents

Monday, 17. May 2004
1. Introduction to the SDD schema - Overview and infrastructure issues (Bob Morris)
2. Common Approaches - ABCD in relation to SDD (Walter Berendsohn)
3. Project Description & TransformationHistory (Gregor Hagedorn)
4. Is DiGIR adequate for SDD? (Bob Morris)

Tuesday, 18. May 2004
1. Introduction to Prometheus II (Trevor Paterson)
2. Implementation Experiences using the Castor data binding framework (Jacob Asiedu)
3. OWL and SDD (David Thau)
4. Federation of terminology

Available presentations from the meeting

All presentations below are available as powerpoint files:

 


Monday, 17. May 2004
Lead topic: Overarching issues applicable to several TDWG standards

Participants

Addink, Wouter [WA] (ETI) waddink-at-eti.uva.nl
Asiedu, Jacob [JA] (Univ. Boston) kasiedu-at-cs.umb.edu
Berendsohn, Walter (BGBM Berlin) wgb-at-zedat.fu-berlin.de
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chalubert, Antoine [AC] (LIS, Univ. Paris 6)
de la Torre, Javier [JT] (BGBM Berlin) j.torre-at-bgbm.org
Döring, Markus [MD] (BGBM Berlin) m.doering-at-bgbm.org
Gallut, Cyril [CG] (LIS, Univ. Paris 6) gallut-at-ccr.jussieu.fr
Güntsch, Anton [AG] (BGBM Berlin) a.guentsch-at-bgbm.org
Hagedorn, Gregor [GH] (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan [PH] (Univ. Illinois, Champaign) pheidorn-at-uiuc.edu
Hobern,Donald [DH] (GBIF, Copenhagen) dhobern-at-gbif.org
Kennedy, Jessie [JK] (Edinburgh) J.Kennedy-at-napier.ac.uk
Morris, Robert [BM] (Univ. Boston) ram-at-cs.umb.edu
Paterson, Trevor [TP] (Edinburgh) T.Paterson-at-napier.ac.uk
Thau, Dave [DT] (San Francisco) thau-at-learningsite.com
Vignes-Lebbe, Régine [RV] (LIS, Univ. Paris 6)

Agenda

Introduction to the SDD schema - Overview and infrastructure issues (Bob Morris)

(The slides of this talk are available.) [Editor's note: I am very grateful to Markus Döring for helping me with some notes summarizing the presentations and discussions on Monday. Since I am extremely short on time and want these minutes to be published quick I have left many remarks with little editing. Any errors are mine of course, and will remain here unless someone points them out to me!]

... SDD looks complex and requires a good primer introduction. Once basics mastered, easier than thought. SDD tells you how you tell/the way you make descriptions. Document metadata esp. interesting for inter-application exchange. "Certain helper objects" for measurements etc. in GeneralDeclarations. Most important you need to define your terminology <Terminology>. In >Entities> are the things you want to describe. "Classes" = taxa, "Objects" = specimen. Class hierarchies suit phylogenetic research. SDD imposes no hierarchies but allows designation of hierarchies. Class hierarchies are optional. Project working on NEXUS successor (phylogenetical data format) is taking a look at SDD for this. They are very interested. >Descriptions> contain the data. >IdentificationKeys> is latest contribution. Key structure = tree.

Design philosophy:
- strongly typed
- use schema types to ease extensibility/evolution
- single root element
- heavy use of key/keyref mechanism to use context based ID/IDREFs (similar to db PK/FK)

Semantic isolation
- IDs get a datatype. Therefore keys pointing to wrong type this results on invalid docs. References use keyref instead of labels.
- Labels and human targeted text have multiple "audiences". Not only different languages. Mainly about expertise level.
- Optional ontology expressed in "Glossary" and "Concept Trees". Remains controversy whether it needs to be externalized.

Infrastructure Components
- Elementary Base Types
- - simple Types in SDD-SimpleTypeLib.
- - complex Types in SDD-CommonTypeLib
WB: Both are meant to share types between ABCD, TaxonNames and SDD.
- Types used in descriptive data taken from terminology definition, not from schema.

Simple Types: mostly restrictions of XML schema for stronger typing string: non empty xs:string = String. Probability: between 0 and 1. Complex Types: Several String Types (and any URI), extended by 2 attributes: language + preferred, several special purpose types e. g. IPR.
WB: locale instead of language + audience?
GH: language already use language + culture ISO codes
CG: yes, allows also the language code + culture
BM: maybe needs more discussion with other groups. Locale specifies more than language: e.g. number, late format
GH: it has choice to use or not use culture specific notion

Key Value: simple type, most key are typed with this. Best practice is currently assumed to use of integers. Integer keys are hard to read and check for humans. So everywhere key + keyref is used it is extended by debug key/debugref.
BH: Where is this defined?
GH: There is a simple data type called "Key Value" which is used by all keys
BM: Not to be confused with keys for identifications
CG: How does the debug system work?
Debugkey Attr. = string to help debugging. Giving context to IDs.XSLT exists to fill this. Jacob wrote code, see WIKI.

Audience = a label made out of language + expertise level
Vocabulary Def. Base Type, abstract type. Most labels supports basic html formatting. Mainly to support italicizing for names and super/subscripts like H2O

Coding States Def. Type = signals absent data. Do we have got that right?

Univariate Statistical Measure Def. Type. Nexus problem with continuous measurement needs to be overcome. Not dealt with multivariate statistics a lot yet.

Measurement Unit Def Type = specification of units, inch audience labels refer to intern. standards, glossaries and data boss to other units (e.g. millimeter and meter)

Entities: No character or taxon named in schema. Only how to define them. Classes: minimal a key and label (like Nexus)

Example presentation
BM: Terminology relates to data set. Would it be better to have it globally? Terminologies are hard to distinguish.
GH: Subject for the afternoon. XML identity constraint validation desired.
DH: key/keyref won't work when several datasets are merged using the same keys?
BM: ID/IDRef won't work, but key/keyref is scoped and works
DH: Datasets important for GBIF as central repository.
GH: They shouldn't be related and share info. Therefore they do not have a consequence for further SDD.
WB: ABCD for access to data, SDD aimed primarily for data exchange. Therefore sometimes different views of things.

General declarations: Audience example
GH: Statistical Measure Definition is extension mechanism to express a statistical method.
CG: extensibility, limited here to all new method.
BM: Yes. New statistical method might need a schema change.
JK: to be able to compose max methods in different places, you need for know if 2 "characters" are comparable.
RV: Is method to compare descriptions/characters in the scope of SDD or is it at application level.
JK: add to have it in SDD itself and not externalized.
DT: Seek developed a method ontology. Might be shared.
GH: all things in general declarations should ideally be externalized, in contrast to terminology SDD wants audiences for "ontologies", that's a problem for using external lists.
JA: Can we define other max methods that are dimensionless?
GH: No. Method and Method Value are the defining parts and used to identify a distinct method.

Data set holds key definitions. Several elements can be used to build unique keys.

GH: In general the annotations in the schema need attention and are the best documentation available at the moment. Please pass any comments and correction on to me!

------ Coffee break ------

Common Approaches - ABCD in relation to SDD (Walter Berendsohn)

ABCD, SDD, Names are comprehensive and use container elements around repeated elements
- ABCD is strictly hierarchical, not using references apart from refs to units themselves
Two proposals:
a) metadata envelope
b) use shared types
- Recursive transformation needed? problems for some software
- Roll back of data needed? Workflow description is a whole new task!
WD: 3 Issues:
- IPR
- Transformation history (tracing)
- Data reconstruction via roll back and transformation history
WB: deal with IPR and techn. transformation separate!
DH: DataSet
Provenance
actions
IPR
generator

------ Lunch ------

Project Description & TransformationHistory (Gregor Hagedorn)

Geographic and TaxonCoverage versus Scope
- Coverage is existing in document already. External Index Service is better suited for discovery.
- Scope is additional info. Valuable to terminology lists for SDD?
GH, JK: Index Fungorum - Names schema
BH: Qualified Dublin Core is suitable.
GH: Coverages could be omitted and scope elements used only.
CG: Adding a single Aim/Scope Element with plain text instead of different scope elements.
GH: This would be the same as description - human readable text.
BH: Need a coverage section. Text string or list of typed elements.
GH, BM: SDD covers medical diseases as well.

Merging/combination of data sets
JK: Name schema sees Sp2K GSD as revisions
GH: Quality of data can only be detected in relation to scope of project.
MD: SDD project is metadata about the full SDD, which does not need to be fully present.
GH: Project definition is assumed to be authored by humans, able to reflect complex IPR situation in appropriate statements.
WB, BM: "must cite" flag for IPR statements?
WB: IPR type is lacking "who said this IPR. Needed?
BM: Often happens in copyright: "This may contain..."
WB: IPR only on SDD level, manually edited, with specification of agent proposed. Transformation history.
JK: IPR issues needed on object level as well. For Names, specimen, etc.
GH: Revision status type could be used in Names Schema as well. Please comment.
GH: Project definition could be called header or cover page.
WB: Split project in 2 - general and specific.
DH: Couldn't we use a common/shared schema root with SDD and metadata?
WB: Yes. That's the goal.
GH, WB: Try to find shared metadata and slots for ABCD/SDD/Names on common SDD structure.
GH: Project def. defines the last manual edit step. Transformations are only relevant since then.
If new manual work is done, a new project is created with erased transf. history.
WB: 1 Conclusion is the cost provider has the burden to care about IPR, acknowledgment etc.
GH: Transformation is mainly for online service passing/changing data and has not to be stored in databases.
WB: technical contact in UDDI used in business level should be a data sets level.
GH: WIKI discussion for specific proposals.
DH: Datasets, SDD, header with transformations + metadata, body with SDD/ABCD/Names whatever schema mechanism.
GH: Walter and Gregor do a proposal and select common elements.

Proxy Objects
GUID = Project ID and local ID --> provided by GBIF like service
GH: LSID might be suitable for projects as well
JK, WB: where do GUIDS come from when adding new data?
Proxy objects to deal with external, unreliable (asynchronous) sources or even non existing sources (e.g. non digital specimen)
GH: We need a service telling me whether a given specimen is available.
GH: updated label element disputable. Should keep newer label to protect original label.
BA: Label term not correct.
BM: No. Its not the full object
BH: Atomized add on data feels very uncomfortable.
Only if used for data entry only - never updated from external sources that need to be parsed eventually.
JK: If GUID exists for object, add on data should come from other namespace for specimen, ABCD or names schema.
GH: Because SDD application can stay stable when ABCD B updated. Only needs to understand ABCD to update proxy.
JK, GH, BM: Prometheus view how specimen/taxon description relate.
CG: Thinking to concepts problematic?
JK: Can't link a specimen to a class (taxon), when creating a new class.
GH: Only temporary state. Once published, a valid link can be established.
GH: Proxy objects apart from base type might be reusable common types for TDWG?
Agrees to JK that if ABCD would use proxy objects SDD could use it directly and wouldn't need to extend the base proxy type.
GH: Renaming needed for proxy object.
DH: 1) linking extension 2) label 3) SDD specific mandatory add ons 4) add ons to better identify object without existing external source.
Object link - choice instead of sequence?
BM: Sequence of diff. methods with fall back to URL possible - application has choice.
[Editor's note: sequence of different urn methods implemented, but LSID, DOI only once, URL multiple times]
WA: Feeling a URL could be separated in base + doc
Agreement that ObjectLink becomes repeatable choice instead of sequence.

Is DiGIR adequate for SDD? (Bob Morris)

DiGIR suitability: - No!
BM: substitution group mechanism problematic. Schemata using DiGIR must specifically inherit from it.
MD: No records exist in SDD.
GH: Descriptions with everything else related. Specialized Soap Service? Xquery on native XML DBMS?


Tuesday, 18. May 2004
Lead topic: Review of contentious issues of the descriptive core of SDD

Participants

Addink, Wouter (ETI) waddink-at-eti.uva.nl
Asiedu, Jacob (Univ. Boston) kasiedu-at-cs.umb.edu
Buis, Rob (ETI) rbuis-at-eti.uva.nl
Chalubert, Antoine (LIS, Univ. Paris 6)
de la Torre, Javier (BGBM Berlin) j.torre-at-bgbm.org
Gallut, Cyril (LIS, Univ. Paris 6) gallut-at-ccr.jussieu.fr
Hagedorn, Gregor (BBA Germany) g.hagedorn-at-bba.de
Heidorn, Patrick Bryan (Univ. Illinois, Champaign) pheidorn-at-uiuc.edu
Hobern,Donald (GBIF, Copenhagen) dhobern-at-gbif.org
Kennedy, Jessie (Edinburgh) J.Kennedy-at-napier.ac.uk
Morris, Robert (Univ. Boston) ram-at-cs.umb.edu
Paterson, Trevor (Edinburgh) T.Paterson-at-napier.ac.uk
Thau, Dave (San Francisco) thau-at-learningsite.com
Vignes-Lebbe, Régine (LIS, Univ. Paris 6)

Agenda

Note: we have no discussion notes from Tuesday, so mainly the presentation slides are available here...

Federation of terminology

This discussion tried to attack one of the big remaining problems for SDD. So far we have focused on making it work for locally complete data sets that do not use external terminologies or distributed descriptions over multiple servers. We do explicitly handle the External data sources/resources (whatever the difference between these may be :-)) with proxy data objects and the ExternalDataInterface, see ProxyDataModel on the WIKI. However, for sharing the things specific to SDD we have no clear model.

A major difference to the externals is that in the internals we want rich data. We do not simplify/encapsulate data and reduce to a interface definition, but we want the whole thing. The ObjectLink mechanism in the ProxyObjects is still relevant, but not the rest. Two models are on the table:


The last discussion about the problem of different handling of categorical states and statistical measures, and the lacking set definition for the latter was presented and discussed, but no idea arose and time was too short to pursue it. This needs to be discussed on the WIKI. Currently the SDD schema is broken at this point, because it offers partly two alternatives.


Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)



Return to the SDD starting page.

First published 2004-05-30, last update: 2004-08-19.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser