SDD document: Object-relational design of SDD

TDWG working group: Structure of Descriptive Data (SDD)

Synopsis

SDD uses an object relational model, validated through xml schema identity constraints. To validate this model we make liberal use of key and keyref identifiers that are validated using xml schema identity constraints. This documents discusses some aspects of these keys and explains the optional debugkey/debugref mechanism provided.

Table of Contents

UNFINISHED DOCUMENT!

Keyed objects and human readable representation

Introduction

Most objects that are used in object relations must be identifiable both to humans and to machines. The requirements for these two types of consumers differ:

These requirements can either be fulfilled with a single identifier, or with a pair of identifiers, one addressing humans and another one machines.

Numbers or text as identifiers?

We can consider four different models for unique identifiers:

Type SizeExample Machine
generated1
Unique-
ness
Memoriza-
bility
SemanticsLikelihood
of change
integer
number
32 bit"872773" yeslocallownonelow
GUID
number
128 bit"{4B0AF0B9-8683-4D4E-
 9538-7AC49D9D1767}"
yesglobal nonenonelow
text codevariable"zqxtoslnpr"yeslocal lownonelow
text labelvariable
(long)
"leaf shape lanceolate"nolocal2 highhighvery high

Notes: 1 "Machine generated = can easily be generated without user interaction. 2 Local uniqueness of semantically meaningful text labels requires interaction between the program and the user.

A few years ago the compact and constant size of 32 bit numeric identifiers compared with most text strings was an important argument in favor of using numbers to identify objects. Although this efficiency advantage of integer number is still valid, given the increase in available storage space and processing speed it no longer needs to be a decisive argument. The properties regarding uniqueness and likelihood of change thus become the guiding factor in the choice of identifiers.

Perhaps the most important property is the likelihood that an identifier will be changed. We expect the SDD standard soon to be used in federated projects and expected to add validation once xInclude or other mechanisms become available. The keys of characters, states, etc. are thus not a matter of a locally defined document or application, but a globally accessible resource. As a consequence, it become essential that keys are stable (immutable) over time as well as unique at any given point in time.

Although semantically meaningful identifiers have significant advantages (e. g. when debugging applications), the semantic load makes it likely that the identifier will change over time. The problem is that terminology does evolve (especially so in the first minutes after it has been created!). However, most applications that try to hide the key creation from the designer will have to derive the key by an abbreviation mechanism from the first label created.

As a consequence semantically loaded keys:

(A well known case against semantically loaded identifiers is the past experience with the design of most URLs in the world-wide-web. Everybody responsible for web design agrees that URLs should not change. Nevertheless, the average survival time of a hyperlink is relatively short. In most of these cases the content has not even been truly removed, but the folder structure has been reorganized or the content moved to a different provider. Whenever folder structures reflect hierarchical organization, that hierarchy is likely to change over time, preferable at the root to maximize the damage...)

Among the identifier types that are not loaded with semantics, the simple integer number seems to be preferable. The GUID has little advantages, given that any public identifier is a combination of the globally unique project identifier and a within-project key value. Integer numbers seem to be more natural to use for non-semantic identifiers than random text string. Also, constraining the values to integers will guide developers in avoiding the problems associated with semantic keys.

An important point related to the use of GUIDs: in a federated situation, the project is above several federated services. Each service must manage its own identifier space. Would it be advisable to introduce a Globally unique modifier per service? Alternatively the uniqueness model could be extended to three-levels: Project-Service-Key.

Alternative labels for human consumption

A consequence of selecting numeric identifiers is that alternative labels for human consumption must be provided. Examples for the use of these labels is the selection of characters or states when describing or identifying objects, the selection among alternative concept trees, guided keys, etc. The SDD schema requires multilingual labels (= label element with multiple audience-specific representations) for most objects that have a numeric object identifier.

For most objects in the SDD schema the human readable identifier is called Label and provided for an unlimited number of different audiences. (At this point it is interesting to note that the requirement for multiple audiences or languages would require an alternative label in any event, even if labels in some language would be considered required and persistent identifiers...)

In the case of resources the identifier is monolingual and called FreeFormDescription. This label may be a cache of data provided by the service provider for a given resource. As a consequence, it is difficult to require that it must be long to defined audiences. (Should resource labels be called Label as well, although not audience specific? Should they be made audience specific? How, if external services can update them?).

The human readable object labels are required xml schema cardinality constraints and their uniqueness is guaranteed through xs:unique identity constraints. For example, no two characters in a project and no two states within one character may have the same label for a given audience.

Note: it is possible that different objects have the same label for different audiences. If labels are generally short (as in states) the same sequence of character may have different meanings in two languages, and both meaning may be applicable to the definition. Example: Karyopse in German and English!


Enriching key and keyref information to improve debugging efficiency

One problem when using numeric key/keyref values will be that, although SDD instance documents are not explicitly designed for human consumption, a certain readability is an important feature to help application developers to be more productive during the debugging process. Given the amount of key/keyref object relations present in the SDD schema, the exclusive use of numerical key-references in the descriptions will be a significant strain during debugging. It is therefore proposed that a mechanism is defined to enrich all defining key and keyref attributes with corresponding non-defining human readable debug information.

One option to achieve this would be to use xml comments after the states to provide this information (<!-- debug: leaf shape#lanceolate -->). However, the association between comments and states is often not preserved during processing (e. g. XmlSpy software will misplace comments if the file is edited in tabular mode) and that it may be difficult to find the object that is associated with a comment if the file is analyzed through a DOM (document object model, one or several standard ways to access hierarchical xml information). Furthermore, it is desirable that the debug labeling of objects can be inserted, removed, and updated. If comments are used, some mechanism must be devised to distinguish such comments from other kinds of comments that may be present in the xml file.

It therefore proposed to add an optional debugkey attribute to every element that has a numeric key attribute, and an optional debugref attribute to elements containing a keyref. The debugkey is a human readable representation of the key value. For most objects this can easily be created based on human readable labels. These are required to be present at least for some audience and must already be unique. Appropriate identity constraints for labels have been defined in the SDD schema (see below for Rules for creating debug labels). Debugref attributes will contain the corresponding debugkey value, just as keyref attribute contain the key value of the corresponding object.

Since debugkey and debugref have identical values, corresponding objects can be found with a simple text search. This may even be easier than searching for the numeric keys, because different uniqueness domains may contains duplicate values (e. g. both a character and a state may have key="1").

A generic XSL transformation: http://www.cs.umb.edu/efg/SDD/sdd/debugSDD.xsl has already been written by Jacob K Asiedu to create, update, or delete these attributes in SDD documents. The XSL deals with the situation that the default or the preferred audience may not be present and an alternative label may have to be used.

The two attributes are named "debug..." to clearly indicate that they are informational only and should not be relied on for any other purposes than debugging applications. Especially, they should always be ignored during import, and not normally generated when exporting data. The use of the generic XSL transformation is by far preferable to application-specific code generating debugkey/ref information.

See also the WIKI on this topic: http://wiki.cs.umb.edu/twiki/bin/view/SDD/DebugrefAutomation.


Rules for creating debug labels

The material in this chapter is aimed at developers interested in how to generate debugkey/debugref information. If you are simply using the available xsl (see above), you do not have to concern yourself with the following information!

Generating debugkey/debugref information is a two-stage process. In a first step a debugkey attribute is generated (based on available label information) for each object that carries a key. In a second step the debugref is created for all objects with a keyref attribute. The key/keyref identity constraint is followed, the corresponding debugkey is found, and written as a debugref.

This process is slightly complicated through the fact that the generation of some debugkeys mentioned under Special Cases below requires debugrefs to be already present (see Special cases below). The best strategy may be a 2-pass process.

Most keyed objects in the SDD schema carry a unique human-readable representation. This representation is not necessarily present in the language of choice, nor does it necessarily exist for the defaultaudience, but it must exist at least for one audience. The following rules point out where these representations can be found.

Simple labeled objects

In the case of labels abbreviated expressions may already exist. Therefore the following general "LabelRule" should be applied:

The majority of keyed objects can be treated using this "LabelRule" directly on the element that carries the key attribute. For example, the debugkey label for Terminology/CodingStatusValues/CodingStatus/@key can be found by applying the LabelRule to Terminology/CodingStatusValues/CodingStatus/Label. The following is a list of these xpaths to these objects:

Terminology/CodingStatusValues/CodingStatus/@key
Terminology/StatisticalMeasures/StatisticalMeasure/@key
Terminology/Modifiers/Probability/Modifier/@key
Terminology/Modifiers/Frequency/Modifier/@key
Terminology/Modifiers/General/Modifier/@key
Terminology/Modifiers/Sets/Set/@key

Terminology/Characters/Character/@key
Terminology/ConceptTrees/ConceptTree/@key

Keys/Key/@key

A closely related case is:   Terminology/Glossary/GlossaryEntry/@key
where Terminology/Glossary/GlossaryEntry/Representation[@audience=default | 1]/Term can be used. This is very similar to the label rule, except that no abbreviations exist and Term replaces Text.

Resources

Comparatively simple is the situation for all objects derived from ResourceConnectorBaseType. The required FreeFormDescription can be directly used as a debugkey label:
Entities/Classes/Class/@key
Entities/ClassHierarchies/ClassHierarchy/@key
Entities/Objects/Object/@key
Resources/Agents/Agent/@key
Resources/Publications/Publication/@key
Resources/Geography/Locality/@key
Resources/MediaResources/MediaResource/@key

Nodes in trees

The SDD model has several keyed objects that are nodes in trees:
In Keys/Key the lead nodes have a key: .//Lead/@key
In Terminology/ConceptTrees/ConceptTree the concept nodes have a key: .//Concept/@key

These objects pose special difficulties. Node labels are defined through the context of the tree, i. e. multiple nodes may have the same label if they have a different path. A unique key would be a concatenation of all node labels. In the case of dichotomous keys this is impractical, however, because each key statement in itself may be rather long. In the case of the concept nodes, these are not required to be labeled.

As a start, perhaps the best debugkey label is a fixed string ("ConceptNode: " or "KeyNode: ") plus the label of the node if it exists, plus the key itself in brackets. This provides reasonable uniqueness. Better methods can later be developed.

Special cases

(Note that most debugkey labels here will be created based on previously created debugref attributes. As a consequence, these keys can be created only after both debugkey and debugrefs for the previous objects are created!)

Descriptions/CodedDescription/@key and
Descriptions/NaturalLanguageDescription/@key
It is doubtful whether it will be helpful to provide a debugkey label for Description objects at all. One method would be to use the debugref of the class or object designation, plus the debugref of the citation plus the description key itself. Note currently that no keyref exists to this key.

Terminology/Characters/Character/Categorical/States/StateDefinition/@key
The labels of locally defined states are unique only within each character. Although this may be sufficient for debugging purposes, it is recommended to use the LabelRule to obtain a locally unique label and add a character identifier in brackets. As character identifier the debugkey for the parent character may be used, but it is recommended to simply use the key value instead. Example: "lanceolate (char: 328663)". Due to the structure of SDD, the debugkey label of the character can easily be found.

A closely related case is the reference within characters to the generic states or statistical measures:
  Terminology/Characters/Character/Categorical/States/StateReference/@key
  Terminology/Characters/Character/Numerical/StatisticalMeasures/StatisticalMeasure/@key
Both these objects carry a key as well as a keyref. The debugkey label for these objects is easily constructed by using the debugref and add the character identifier in brackets, as with the locally defined states discussed above.

A related case is found in the generic state definitions that dependent on the concept nodes:
.//Concept/GenericStates/StateDefinition/@key.
The labels of generic states at a concept node are already unique within each node. Consequently, a good debugref is a combination of the debugref label for the node (follow "LabelRule") plus the debugkey of the node.

debugrefs

(Ideally a program that creates debugrefs would go into the schema, read out the path to keyref identity constraints, find the corresponding key identity constraints and use this to understand the relationship. Unfortunately, because many references are embedded into complex types that are used multiple times, this is not a simple task. Therefore it is probably necessary to hard-code in the xsl the path to all keyref objects in the instance file.)


Audience keys

Although Audiences [@@add link to doc@@] are user definable, they form an exception in that string (xs:Name) rather than numeric keys are being used. A recommendation exists that audience keys are combined from the language code, the expertise level digit, and if multiple defined for the same combination, a differentiating lower case letter.

As a result they are easily recognized during debugging and no "debugref" attribute has been defined for these.


Conclusion

The different requirements of object identifiers for machine interaction, human consumption, and the process of debugging have lead us to a model, where objects are identified through three complementary identifiers:


Gregor Hagedorn; Vers. 1; 20. Nov. 2003



Return to the SDD starting page.

First published 2003-11-20, last update: 2003-11-21.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser