TDWG working group:
Structure of Descriptive Data
Minutes of working sessions in Australia, 11-14. March 2002
(Version 1.1)
The meeting was split into two parts (Canberra and Sydney) to maximize
the participation. Many thanks go to Greg Whitbread and Karen Wilson who provided
the local facilities in Canberra and Sydney, respectively!
Minutes SDD meeting March 11-12 in Canberra
Participants
Jim Croft, Neil Fitzsimmons, Gregor Hagedorn, Liz Kolster, Bob Morris, Steve Shattuck, Kevin
Thiele, Greg Whitbread, Eric Zurcher.
Major discussion points 11.3.2002
Standard as a whole:
- Use cases for the standard:
- interactive identification
- generation of guided keys
- natural language reporting in xhtml in multiple languages
- form-based data entry and revision
- markup (complete or incomplete) of legacy descriptions
- phylogenetic analysis
- analysis of character distribution
- analysis of errors
- improvement of scientific terminology used for descriptions
- The standard should support data exchange of
- entire projects (defining item descriptions as well as terminology)
- fine-grained partial data (e.g. through web-services)
- For entire projects: Data should be as complete as possible, the format should be suitable for
archival of scientific data. Data exchange between applications should be possible with minimal
information loss.
Markup of legacy descriptions versus content generation from scratch:
- For markup purposes, a hierarchy markup (e.g. organism-parts hierarchy) of the description is
required and should be supported.
- Strict standard (and xsd) for fully validated and cross-referenced information, lenient
standard (and xsd) for markup which may be potentially incomplete (mixed xml-content). This is
similar to xhtml strict versus transitional.
- It is desirable to be able to treat generated documents identical to marked up legacy
descriptions. The generated natural language descriptions should therefore contain character as
well as hierarchy mark-up.
- It is possible to define only a single document type, which contains the full structured
description, as well natural language text interspersed through the item descriptions. A hierarchy
could be added as markup, which, however, has not bearing on the data. Under this model, the strict
version should be valid under the lenient schema, but not vice versa.
- Example for markup of existing text, allowing mixed content and adding hierarchy/structural
markup. The markup should not change the original description (which may use different terminology
in different items), and should allow characterizations that are not yet:
<Document><Items><Item><NaturalLanguageDescription lang="en"
lang="basic">[...]
<charactergroup keyref="stipules">stipules
sharply pointed,
3 mm wide,
<character keyref="stipule-color">
<state keyref="brown">
<modifier keyref="dark">darkish</modifier>
brown;
</state>
</character>
</ charactergroup> [...]
- If mixed content should be avoided, free text could be embedded in elements, e.g.
< Document >< Items >< Item ><
NaturalLanguageDescription lang="en"
ExpertiseLevel="basic">[...]
<charactergroup
keyref="stipules"><wording>stipules</wording>
<wording>sharply pointed, 3 mm wide,</wording>
<character keyref="stipule-color">
<wording>darkish brown;
</wording>
<state keyref="brown">
<modifier keyref="dark"/>
</state>
</character>
</charactergroup>[...]</NaturalLanguageDescription></Item></Items><Terminology>[...]</Terminology></Document>
In both cases, "sharply pointed, 3 mm wide" remained unparsed. Note that the entire character
wording is left in a single block here, to minimize markup. The description text not yet identified
as structure or character data can be easily identified. The version without mixed content was
agreed by the group.
- The terminology may be defined, but does not have to be present in the lenient xml-schema part
in the NaturalLanguageDescription element. This enables markup projects to start markup and
validate their document syntactically, before creating the full terminology for it. A first
terminology could be automatically created by a processor from the parsed item descriptions. The
terminology could then be revised and the markup validated using it. This would require a variant
of the main schema, where the keyrefs in the NaturalLanguageDescription element are validated, as
they generally are in the FormalDescription element.
- Alternatively, two document types could be generated, one optimized for natural language and
containing character and hierarchy markup, but not all data, the other containing all data, but not
the natural language markup. Agreement was reached that the second solution is probably preferable.
The creation of an example instance document was continued under this assumption (however, this
decision was reversed 2 days later in Sydney, see below).
Discussion about repeated observations / raw data.
- The need for this was generally accepted, but no formal solutions were proposed. Challenge
cases and further discussions are needed.
Top-down modeling of major classes in UML
- Character is a definitional assemblage of
- property with property states (color: red, green, ...; shape: globose, elliptical, cylindrical, ...)
- structural part of organism
- method of observation
Attempt at simple xml-document
- Should cope with DELTA while being more flexible, on the basis of Kevin's first challenge
case.
- Test of hypothesis that the basic structure of DELTA can be reused, with modifications.
- Characters can be arranged hierarchically (basic property + properties, methods, structural
parts, etc.). We established the need for multiple character hierarchies. They will be
defined in the terminology section separately from the item instances.
Characters, state, charactergroups need to have a unique id attribute.
- This attribute may be numeric or alphanumeric. For debugging and testing purposes it is
desirable to define it in a readable way, e.g. using the initial character or state name, and
adding random numbers if required to make unique.
- It must be an attribute and can not be replaced by markup:
"<charactergroup><name>stipules</ name>" looks equivalent to "<charactergroup name="stipules">stipules". However, it can
not work in multiple languages: "<charactergroup><name>Nebenblätter</name>", whereas "<charactergroup name="stipules">Nebenblätter" is
possible.
- This id-attribute can not be "id" (predefined in xml) and should not be "name" (confusion with
the real character or state name). Options were "code" and "key", the latter was preferred since
key/keyref is standard xml terminology.
- The id-attribute may not be changed after its initial creation. It may be used in
markup-projects that are disconnected from the database.
- States may have a key that is unique only within a given character. Bob verified that a
combined key of character keyref and state keyref can be supported in xml-schema, although this is
more complicated than a document-wide unique state key.
- In Canberra and in the working document, these assumptions are used. However, later in Sydney,
the discussion was reversed and document-wide unique, artificial state keys were preferred (see
below).
Major discussion points 12.3.2002
- Discussion of the outer elements, item versus taxon.
- Each item, which may be taxon, object, part of object, specimen, disease, etc. may have several
descriptions.
- The item is a referencing container, referencing the source of data (direct specimen
observations, a single published reference, or a newly authored description based on a collation of
personal observations and information from the literature).
- The description is an authorship container defining the person or collaboration team
responsible for the treatment and the interpretations necessary.
- The origin of an information is defined as being either authored, collated from below (e.g.
species from specimen, genus from species), or implied from above.
- The origin could be defined at the description level. However, this would not allow mixed
descriptions, where some information is inherited from above, and some is collated from below,
whereas other is directly authored. This topic this needs further discussion!
- The outer elements of the document root include a resource container, that may either point to
URIs, or directly contain the encoded media content. The format of this container needs to be
further specified in future work.
- A stored natural language description could be an element of Description. It could contain the
full text from a flora, that may or may not be partially marked-up, or it may contain a
autogeneration natural language description as a cache. These situations can be distinguished
through the origin attribute.
Feature versus character revisited
- Character should only have one level. Feature is more general and may occur recursively.
Kevin's proposal to use recursive feature structures, that include the hierarchy elements and uses
terminal value nodes is discussed. Having hierarchy here as well requires two different methods to
read and write hierarchies, one for the embedded in the item descriptions (which may be incomplete,
since the items output do not contain all hierarchy elements) and one for the general multiple
hierarchy definitions in the terminology section.
- From Collins English Dictionary:
feature n. 1. any one of the parts of the face,
such as the nose, chin, or mouth. 2. a prominent or distinctive
part or aspect, as of a landscape, building, book, etc. 3. the principal film in a programme
at a cinema. 4. an item or article appearing regularly in a newspaper, magazine, etc.: a gardening
feature. 5. Also called:feature story.a prominent story in a newspaper, etc.: a feature on prison
reform. 6. a programme given special prominence on radio or television as indicated by attendant
publicity...
character n. 1. the combination
of traits and qualities distinguishing the individual nature of a person or thing. 2. one such distinguishing quality; characteristic. 3. moral
force; integrity:a man of character. 4. a. reputation, esp. a good reputation. b. (as modifier):
character assassination. 5. a summary or account of a person's qualities and achievements;
testimonial:my last employer gave me a good character. 6. capacity, position, or status:he spoke in
the character of a friend rather than a father. 7. a person represented in a play, film, story,
etc.; role ...
Character hierarchies (CharacterGroup element):
- Bob showed how to use key/keyref xml-schema definitions for the validation of the
integrity.
- An example case was designed.
- Hierarchy items may have child hierarchy items as well as characters on the same level (e.g. a
character directly for 'entire organism'). Unnamed hierarchy items to allow only either a hierarchy
item or characters are considered an implementation issue and should not appear in the output.
- Character hierarchies will be used to define
- Hierarchical trees
- Flat classifications
- Character subsets (e.g. for reporting or editing smaller views)
- Gregor's proposal to introduce an attribute 'EnsureCompleteness', which gives the application
the opportunity to try to automatically add new characters to those character groupings that are
marked such was rejected.
Finally...
- The distinction between frequency modifiers and other modifiers, advocated by Gregor and
challenged by Steve was discussed.
- The instance was converted to xml-schema and revised.
Minutes SDD meeting March 14 in Sydney
Participants
Stan Blum, Gregor Hagedorn, Liz Kolster, Bob Morris, Dave Thau, Greg Whitbread.
Major discussion points 14.3.2002
- Outer structure revised, the recommendation to store unprocessed information and pass it
through in the next output was formulated.
- The decision to split the project into two document types with associated schema was reversed.
Both types, the strict, validated data and the lenient data from legacy descriptions (not parsed,
partially parsed, or fully parsed and marked up) are accomodated in two elements within the item
element. The names "Description" and "DescriptionFreeText" were considered confusing, and from the
following proposals "NaturalLanguageDescription" (for authored or generated natural language
descriptions), and "FormalDescription" (for strict descriptions fully validated against the
terminology defined in the Terminology element) were selected.
Proposed names for strict container
(containing only data) |
|
Proposed names for lenient container
(containing wording and optional parsed markup) |
| FormalDescription |
|
NaturalLanguageDescription |
| FormalDescription |
|
InformalDescription |
| CodedDescription |
|
NaturalLanguageDescription
|
| StructuredDescription |
|
FreeformDescription |
| StrictDescription |
|
LenientDescription |
| Teutonic |
|
Jamaican |
Natural language markup was rediscussed and the examples improved.
- The attributes lang="en" ExpertiseLevel="basic" should be optionally used in the
DescriptionWording element used for markup. It may make be advantageous to parse and markup the
existing natural language descriptions to the extent of using the elements WordingBeforeStates,
StateWording, and WordingAfterStates instead of Wording alone.
- For the ease of translating authored free text descriptions, it may be desirable to allow
multiple languages side-by-side at the character group or the character level, rather than at the
DescriptionWording level. The individual Wording elements should therefore allow the "lang" and
"ExpertiseLevel" attributes. The schema should therefore contain the requirement that either the
entire DescriptionWording is a container specifying the language, and indivual wordings lack these
attributes, or that the DescriptionWording lacks the language attributes, and each indivual wording
element contains it.
New discussion about meaningful or artificial keys for charactergroups, characters, and
states.
- It was realized that the xml-schema would be simpler, if the key of state definitions would be
document-wide unique
- As long as key and keyrefs are only valid within an application or document, the keys can be
freely changed. However, future applications should be able to refer to the terminology of other
projects. A project may define that a given state has the same semantics as another state in a
centrally accessible terminology document. This will enable applications to compare or collate
descriptions from multiple sources. To support such functionality, the keys should remain unchanged
when being migrated from one application to another, and should remain constants as long as
possible.
- Many character/state rearrangements do not change the semantics of a state. For example,
whether something observed by different methods is stored in one or two characters is often a
matter of taste. Solution one may be:
Surface roughness (as observed by light microscope)
- rough
- smooth
Surface roughness (as observed by SEM)
- rough
- smooth
and solution two may be:
Surface roughness:
- rough (as observed by light microscope)
- rough (as observed by SEM)
- smooth (as observed by light microscope)
- smooth (as observed by SEM)
It was therefore concluded, that persistent document-wide unique character state keys are
preferable.
- The keys should be numeric, to avoid taking the key strings at face value. If keys are to be
permanent and unchangable to allow external references, an alphanumeric meaningful key (as still
used in the instance document created during the discussions) can only be based on the initial name
of an element. A character "leaf surface color" may, however, later be changed, for example, to
"leaf upper surface color". This is especially true for character states:
<state keyref="leaves surface characteristics,glossy"/>
could be moved into another
character so that it really would become:
<state keyref="leaves glossiness,glossy"/> (it would have to remain unchanged, however).
- Conclusion: in the example instance the situation (item formal description):
<character keyref ="leaves, shape">
<state keyref ="round"/>
</character >
<character keyref
="leaves apex, shape">
<state
keyref ="round"/>
</character >
should be changed to just:
<state keyref ="1238782374"/>
<state keyref ="1238782344"/>
- Discussion of level of authorship and responsibility attribution. Currently proposed to define
it at a container level (FormalDescription) rather than at individual state scores within an item.
The final decision needs further discussion, especially considering the need to integrate collation
of partial descriptions on the one hand, and the need to be able to contradict an erroneous
statement on the other hand. One solution is to do a deep copy of collated information, and delete
the erroneous statement. The alternative could be a "DeleteState" or "Do not use this state"-statement
on the item x character state level.
XML documents
The following xml example documents and xml schema have been developed during the sessions in Canberra and Sydney:
How to proceed?
- Bob proposes to use a browser-based discussion forum rather than the email list. Gregor is
undecided, since it disrupts current procedures, and may lead to further decreased
participation.
- An important point would be that the forum should be readable by all, writeable only after
login, and that the read/unread status is shown for individual users.
- We should attempt to get a new meeting with at least three days, preferably 4 days discussion financed.
- Opportunities to meet:
- TDWG 2002 Campinas/Brazil October 18-22.
[Note added in August: The SDD meeting is now planned for October 14-17]
- special European SDD-meeting (replacement for the canceled Paris meeting in March)
- GBIF interoperability group
Please send any necessary corrections to G.Hagedorn@bba.de
(Gregor Hagedorn, Convener)
Return to the SDD starting page.
First published 2002-04-17, last update: 2002-10-19.
