Comments on SDD proposal: "Special states declaring missing data"

Posted to the SDD discussion list by Steve Shattuck on Tue, 18 Mar 2003 15:43:01 +1100
in response to "SDD document: Special states declaring missing data" Vers. 1.1 by G. Hagedorn.
(See also the revised current version of that document.)

TDWG working group: Structure of Descriptive Data (SDD)

Gregor: Below are Steve Shattuck's comments on the Special states document Vers. 1.1 by G. Hagedorn in normal text; and

Gregor's comments/replies indented and colored like this text.

One note beforehand: When writing a proposal I usually expect it to be less a draft of what is to become a standards document, but rather a clarifying discussion, containing material that may or may not become part of a standard document.


Steve: At some level the details of the SDD structure are less important than its content. This is because few (if any) consuming applications will use SDD other than as a source of information, this information being translated into application-specific formats for processing and analysis. This requires mapping SDD items to local items and this can be done for any given SDD structure.

Gregor: I disagree with the notion that "SDD items can always be mapped to local items". I think this is not true if data quality is an issue. If I have a format with limited structure (e. g. plain xhtml) there are good chances of capturing the essence of information. If I have a highly structured format (e. g. a database with a detailed information model), I will usually not be able to import low-structured information without either heavily revising it first, or without loosing a significant amount of information. For example, importing DELTA into DeltaAccess requires some very complicated heuristics to make sense of certain situations. In many cases I have to degenerate data by moving data into comments. There may be better ways than chosen by DeltaAccess, but I believe the problem is a principal one.
In my mind there is a distinct difference between information I want to look at, perhaps cite, and then throw away again (for which an ABCD-like polymorphic data representation is perhaps the best solution) and information that should form the basis of the next revision. I envisage SDD for the latter purpose.

Taking this approach, I would make only one suggestion: generalize, Generalize, GENERALIZE! What I mean is that the current Special States document attempts to define all of the possible Special States that are to be part of SDD and even goes so far as to exclude some potential Special States. I would strongly suggest that SDD have a general method for handling these, with at most a list of suggested Special States rather than trying to be too restrictive. I firmly believe that it will be extremely difficult (and short sighted) to try to define these as part of the SDD Standard. The list of useful Special States is much longer than the few suggested in the current document, and an extension is even suggested at the end of the document, a "ToDo" Special State. Many more can be thought of with little effort.

Gregor: I fully agree that there may be more special states and we can add more in the discussion. We can also add more in a "SDD version 2" as soon as the demand arises. I disagree with you insofar, that I believe that the current structure is especially suitable to allow this without any problems in future versions. It is even possible to open it up (no longer constrain the list of missing data indicators present in a SDD document through the xml-schema) so that in future versions project designers can add their own special states.

I believe with generalization you may mean that the "special states" have a set of properties that identify them. Such a mechanism is in fact proposed for statistical parameters in SDD, where in the globally defined shared parameter set each parameter has a set of attributes which, for example, define a group of parameters like arithmetic mean, median, or mode as a "central value", allowing processors to rely on this generalizing attribute if this is sufficient for their purposes (e. g. report formatting).

Ideally the current fixed list of special states would only an intermediate step to later defining a list of properties that define the semantics of these states. For special states nobody in the group so far had a bright idea how to generalize them, but it would be a welcome addition that can be easily accommodated, similar to how it is done in statistical parameters.

This same difficulty (trying to be too specific) has arisen in related activities within TDWG and has resulted in an unfortunate split among participants. There are two camps within the specimen-based standards group, one arguing for a complete description of specimens, the other looking for a set of common items across most groups. The first group has developed a "standard" with some 400 items while the second has selected in the neighbourhood of 20 or so items. My impression is that SDD is taking the first approach: study the problem long enough and the entire universe of data items and their relationships can be documented. I'm not confident this is possible. Biology is too diverse and needs change too quickly to document all of life in any stable, meaningful and useful way. We can provide a framework and make suggestions, but we won't be successful if we try to dictate too narrowly.

Gregor: I agree completely for the biological terminology. However, does that apply to the knowledge management intended by the "special states"/"coding status"?

So my suggestion is that we focus on general needs, such as support for Special States, but that we don't try to define other than at most a few basic data items (such as general Special States). What we provide is a mechanism to support this functionality with the ability to extend it in a controlled and flexible way.

Gregor: I basically agree with this, see above. I thought an agreed list of states for "version 1" might be a simpler mechanism than atomizing the semantics into machine readable properties. We thought in Brazil that perhaps the first version of the SDD format could live with a list of supported states and define the semantics in English in the documentation, but I admit that it would be much better to present a complete solution like in the statistical parameters if we believe we later need it anyways. In the revised version of the document I have added columns to the table in "Summary of all proposed missing data indicators" as proposals of two properties that could be used to describe generalized missing data indicators.


OK, this is where it gets ugly. I have a few specific comments on the current Special States document, some relating to general issues, some to specific comments. These are less important than the message above and largely represent a slightly different perspective on SDD from the one given in the current document.

First, the scope of SDD. Currently SDD is trying to address three separate and non-overlapping issues: (1) data, (2) business practices and (3) application development. Data is the specific information items under consideration, business practices are how we think of this items and how we handle them, and applications are specific pieces of software used by manage data items. The current Special States document mixes these three areas, talking about data items, how people interpret them and how they might be displayed by a computer program. The first area is clearly within the scope of SDD, the second possibly but only in a very general, non-specific way, and the third is clearly outside SDD (and well it should be). Bob made essentially this same comment when he said that he thought "project management problems were not part of SDD." We need to focus on DATA and nothing more.

Gregor: I basically agree on application development or best practices. However, I would like to keep comments on application behavior in the proposals I write, because I find them very useful. My solution is to mark them as recommendation, and perhaps this is not the right term in English. What would be the right term, to distinguish an authors recommendation from an, in the future, "official SDD recommendation"? Also I probably occasionally failed to mark them properly, and any help is appreciated. To a large extent you are correct in your criticism, except that in a way we need about how things are used ("business practices") to prioritize items.

However, there is a deeper issue here that I do realize that my notion of what are data is in contrast to the comments I received from Bob, Guillaume, and Steve. Perhaps I really need some help here, I just don't see how to come around. To me the boundary between data and practices appears blurred, because I believe we are not modeling properties of objects, but concepts that people superimpose on objects to communicate about their properties. The properties of biological objects are DNA-sequences, proteins, arrangements of cells defined in a physiological or mathematically traceable way. Descriptive data very rarely operate on this level.

Gregor: With regard to the question of "project management problems": In my mind, the information whether data are missing because they are unfinished, positively unknown, purposely neglected is vital data for anybody whose intention it is to revise an existing descriptive data set, extending its scope, removing errors, etc. This revision process is embedded in any scientific publication and part of the sluggishness of doing taxonomic revisions is that the revising person lacks significant information from the previous revision. I consider moving the process of describing taxa into a data format to be at the core of SDD.

Moreover, I even think that even the confusingly present (and perhaps inherently confusing) "use default" situation is informative. Note that this is in fact proposed to be functional rather than data by not coding it. However, having Characters in a descriptions that are missing will always be informative in data exchange. The only question is: Does SDD define semantics for this situation, and I think it should. If I get a database with Null values, I indeed often have a problem because I am lacking the semantics of the Null value. Without associated documentation, many data sets can only be guessed about.

The second serious problem I have with Special States is that they aren't "states" in the normal, taxonomic sense. Taxonomic "states" are observations made about biological objects (be they taxa or specimens) that are grouped together as "characters." We use "states" to describe items. However, Special States aren't observable and aren't used to describe items.

What are "Special States"? They are a mix of two separate sets of information. They document the status of data coding (for example "unknown", "not interpretable" and "unfinished work") and they document application-specific business rules (for example "use character default state" - get the data based on a flag set for a specific state of this character). These are fundamentally different things and show the danger of mixing data, business practices and application development.

Gregor: No problem, I agree. I was in fact unhappy about the presentation of use-default state myself, only I did not have the bright idea how to restructure this and thought I publish it to the group rather than sitting on it. "Use default state" is a "special situation", not a "special state". The document already notes that the entire term "Special state" is up for reconsideration; I was never happy with it... (Steve in later email to list proposed "Coding status" as a better term than "special states".) I agree that there is a difference in the unknown etc. special states to coding categorical information. However, they are similar, in that they relate to characters (... if we arbitrarily and for I believe good practical reasons keep the character as the atomic reorganization level). They are probably as different from categorical states as the multiple statistical or measurements "states" are. I am really missing better words here.

Lack of appreciation for these differences has resulted in "use default" being treated as a case of "data have not yet been entered" when in fact a state has been clearly indicated (through a pointer): it's the default state (however the data or application defines this) and is certainly known. It's a user-interface shortcut used by an application to make data entry quicker (in the opinion of an application developer!). The end result is exactly the same as if the state had been entered directly in the description (the taxon by character intersection, or what DELTA calls the Attribute).

I disagree here (and I see that my presentation is not clear enough here). "Use default" is not a pointer to speed up data entry, but rather an explicit decision that information implied from higher taxa is appropriate. The point is: Is the pointer to default explicit or not? This is a structural decision that has to be made in SDD. I propose it to be implicit, which means that adding new data to the terminology (which is beyond the control of the holder of the descriptive data in federated situations!) is also treated as "use default state". In fact, by doing this I propose that explicit pointers to defaults in the terminology are an internal matter of applications and not represented in data exchange through SDD.

This thinking results in some very complex processing being required by consumers of SDD documents. For example, it is suggested that if a description is null, then the "special state" called "use default" should be inserted and that this "state" be set to "not yet evaluated." This means that to interpret an SDD document you not only have to understand the individual data items but also know the rules for processing them (to know that if data is missing you need to insert a special value and then look this value up in another list to translate it into something meaningful). There must be a cleaner way to do this!

(Gregor: I realize that I have to revise the paragraphs you are referring to.)

The "unknown", "not interpretable" and similar "special states" are another problem altogether. These "states" have nothing to do with characters and everything to do with descriptions. This is why Mike Dallwitz didn't include them as part of the Character List. I can understand why you might want to treat these as "states" of characters as this is the basic DELTA paradigm. However, there is no need to do this and doing so only complicates things. For example, to tell if a description has not been coded you'll need to get its state(s) and then check their translations to see if any are "unknown", "not interpretable" or any one of who knows how many other special conditions.

Gregor: I am very open not to call these "semaphores" states, but I find the statement "have nothing to do with characters and everything to do with descriptions" misleading. Below you call the "semaphores" flags. However, they do not apply to a description as a whole, but they inform about characters in the object that is being described.

I am not sure what you mean with translations. In the schema the key values for the global special states are fixed in an enumerated list, which informs the processor about the semantics.

A much better way to implement this functionality would be to store an "uncoded" flag with the description along with an (encoded or text-based) explanation ("unknown", "not interpretable", "too lazy to code this", "don't have proper specimens", "To Do" or what ever). This is both direct and allows the explanation to change in a simple and flexible way.

Gregor: No problem calling these "things" "special flags" or anything else rather than "special states" to avoid confusion. However, they still have the scope of a single character when used in a description.


These are the most serious and fundamental problems I have with the current document. I would also add the following minor comments (some of these expand on the above comments as well).

The meaning of the DELTA "U" in attributes: It's stated in the document that this means "attempts to research the information were made but failed" and that "the state U in DELTA is also used for cases where data are present, but the author is unable to interpret them in current terminology." The DELTA documentation states simply that "A missing attribute is equivalent to an attribute with pseudo-value U" and that "If the state of the character is unknown, then the character is omitted, or the state value coded as U." I can't find anything in the DELTA documentation that gives a reason for the use of U or suggests that "attempts to research the information were made but failed" or anything similar. It would seem that DELTA treats nulls and U as the same, nothing more.

Gregor: Thanks for the clarification concerning the CSIRO documentation, I clearly was wrong. However, I believe that using the U in the way mentioned (differentiated from not-coding the character) describes common practice.

Remove all sections dealing with "performance", "user-interfaces" and referring to "applications." These deal with specific software developments and should not influence the SDD standard.

Great care should be taken when developing "business rules" for interpreting an SDD document. For example see the discussion starting with "One frequent situation is that characters are added to the terminology. Possible solutions to achieve a synchronization between descriptions and a separate terminology in this case are:". It may be important for SDD to document this situation but HOW it is dealt with is up to individual applications and has little to do with the data itself (which is the focus of SDD). The same problem occurs with the statement "The omission of character coding in a description should be used to express the "use default" state." This mixes data with processing. The lack of data is simply that - it implies nothing more and NO assumptions should be made about it. If the default state should be used then the author needs to state this.

Gregor: I see a need for business rules, and especially asking whether the data format will still function in a federated, disconnected environment, where definition of terminology and application of terminology are decoupled. However, you are correct that the examples you give are indeed misleading and I have to revise them.

What does "terminology" mean? It seems to be the character/character state list but this is unclear. It might be good to develop a glossary of terms so we all know what we're talking about.

Gregor: I agree about the glossary. Terminology defines the semantics (partly machine readable) in which terms the descriptions are expressed; see Brazil (SDD/TDWG 2002) minutes.

We also need to define "schema." My understanding is that a "schema" is a model of the structure of the data and not the data itself. If you add data you don't change the schema (assuming that this addition is permitted by the schema - it's like adding rows to a database table - the schema doesn't change, only the data). But at several points the document says "... the case that the terminology is changing ("schema evolution") is ..." or similar. This sounds like changes to the data change the schema. Is this the general view of schema or am I on the wrong track?

Gregor: We have two types of schema: one defined by SDD, the other by the biologist in the project. In contrast to other data areas, only part of descriptive data are expressed directly (and above you have been arguing against this) in the SDD-schema. Much of the relevant information in a description is expressed indirectly through reference to a terminology (Character list in DELTA). The character list itself defines a schema for the description. If in DELTA you delete a character in the character list, all descriptions still referring to this character are invalid, just like in a database structure where a row in a table has been deleted, but the data are still in the import file.


The section discussing "Data have not yet been entered", "Data cannot be entered" and "Data could have been entered but a deliberate decision was made not to enter them" seems to mix unrelated cases and is a mess. The options listed are (followed by my interpretation of their meaning):

1) "Use character default state": Data is coded and is not missing (see above).
2) "Unfinished work": Data exists but has yet to be captured.
3) "Not applicable": Data does not logically exist.
4) "Unknown": Data exists but has yet to be captured.
5) "Not interpretable": Data exists but has yet to be captured.
6) "Out of scope": Data exists but has yet to be captured.
7) "Do not need to score": Data is coded and is not missing.

The mix involves data that is being pointed to through a process (No. 1 and 7), data that is impossible to collect (No. 3) and data that can be collected but hasn't (for a variety of reasons) (No. 2, 4, 5 and 6). Keep it simple by recording if data is absent and if it is, the reason. Don't confuse it with application-specific short cuts ("Use default", "Inherit/Compile from Parent/Children") or personal decisions made by an author ("it's too hard to code", "this character is unimportant here", "appropriate specimens are unavailable", etc). These are distinct cases.

Gregor: In a later email Steve Shattuck proposes the term "Coding Status" with an enumerated list of:
- "Coded" (meaning the taxon has been coded for this character),
- "Not yet coded" (meaning it will be coded when I get around to it),
- "Not to be coded" (meaning it can be coded but I not going to) and
- "Can't be coded"
I find this list very useful and the terms are more concise than my Data entry status headings in the original document. In the revised version of the document I have added a column to the table in "Summary of all proposed missing data indicators" two use this as a proposal to generalize missing data indicators through attribute.

There are a number of cases where attempts are made to develop lists of conditions or situations. For example, it is asked "Should a general 'cannot score' (for any kind of reason) be differentiated into" and then three situations are listed. Again, generalise this into "uncoded" with a text-based explanation ("observation method failed", "incomplete specimen" or any one of a thousand other reasons). All the information is captured and the system is rich enough to handle unforeseen situations. This approach also fits perfectly with the previous example.

There is a note that "a general method is planned (@but not yet formalized!@) in SDD" and that "SDD is considering introducing computed characters." We've been at this for at least 2 years now and there are things that are "planned" or are "being considered" for SDD that we don't know about?!?!? Or is there an overall planning document that I'm not aware of? WE are SDD, SDD can't be "considering" things in isolation.

Gregor: I can't remember who exactly was pushing it most (it was not me), but I do remember we had intensive discussions about computed characters both in Brazil and in Paris. Hopefully it can be found in the minutes... I agree that the chapter was too condensed to be intelligible, especially to people not present in Brazil and I have revised it. To me it is a minor point in the current situation, unless somebody pushes the issue of computed characters so that it can be included in the first version.

The entire discussion of inheritance and compilation (or whatever we call it) needs to be thought of in a fresh light. Again, the document confuses data with process. For example, it's stated that "it is desirable that the assumption [of inheritance/compilation] rather than the inherited data are recorded". The author of the SDD MUST present real data, not give a process to get to that data. Yes, the author can tell us how they collected the data ("inherited from direct parent", "compiled using Gouldes Statistic from all coded children", "by reference to the default state for this character") but they also must give us the data, not make us go look for it. The process of inheritance/compilation is too complex to leave to chance and trying to define exactly what it means is too error prone. I would suggest that inheritance/compilation is an application-based activity, not a fundamental aspect of the data. Again, give us the data and separately tell us how it was derived, don't make us derive it ourselves.

Gregor: I agree that inheritance and compilation needs discussion, and this was said repeatedly in the document. I apologize for not being bold enough to remove all references from the special states document. I disagree on the notion that the author (who is that in a world-wide, federated collaboration?) must give us the data in a flat non-hierarchical way. As SDD is currently developing, the data are there: If the family description states that member taxa generally have 3 nuclei during germination of the pollen, this is all the information that is there. I believe it should NOT be coded into each species, even though it has never been observed in that species.

This exact same problem exists with "computed data." I don't see how or why an SDD document can contain the value "CouldNotCompute" because the "generator couldn't compute this". The "generator" is a piece of software, not a data representation (which is what SDD is). "CouldNotCompute" is an error message returned from an application and it should never end up in an SDD document. Yet again, if you want to explain how the data was generated, find, do it, but don't make us calculate it, we don't know how.

Gregor: The underlying assumption is: If computed character are desired (I am rather reluctant to deal with that now, but both in Brazil and Paris several members said that this is very fundamental to their work!), then we
- need a language for coding the expressions
- this should be very flexible and extensible
- many processors of SDD documents will not be able to compute this information
Therefore the idea is that a recommendation is that application supporting computed character should write the results (as cached data) into the SDD document. All required from an non-computing processor would be to recognize computed information and which characters it depends upon. If this processor changes data that require recalculation, it would replace the computed value with the "could not compute" flag.

The discussion about "Not supported as special states, but supported through modifiers" is good but, as above, describes descriptions and not characters. That's why inheritance/compilation is included here.

The statement "Special states express knowledge about why data for a given character are missing in a description and thus make a statement about the entire character" is interesting. It says that special states are about a character in a description. The point I would make is that they are primarily about the description, not the character. They say nothing about the character's use for other items, and only relate to this character in this taxon, and this is through the "description." So let's focus this document on the description and only discuss the character when it's appropriate. This is supported by the statement that "DELTA does not define the special states in the "character list" directive: they are implicitly present in each character." They are not in the character list because the have little to do with characters. They are in the DELTA attributes because they have everything to do with descriptions. I'm sorry to say that the next statement shows the danger of letting past developments influence current work (and I'm as guilty as any): "When a new character is created, DeltaAccess automatically creates the full set of special states." In BioLink we attach this information to the description: you do it when you're coding the data, not developing the character list. I don't know which way is better and as noted above, it probably doesn't matter as each application will translate the information into a local format anyway. What we DO need to do is make sure the SDD format is expressive, flexible and as application-neutral as possible (and ideally simple).

Gregor: I am confused about your discussion of characters versus description. To me a character exists both in a descriptions where it is applied to an object, and in the terminology (DELTA "character list"). Give me some help how I can avoid the confusion you see. I never intended to write about character definition, as you seem to understand it.

However, the real point where we differ is covered in "Should missing data indicators be implicitly present in each character?" My presentation may be confusing, but my basic question is: should all special flags/states/whatever be always available, or should the designer of the terminology keep control which ones are available in the description? Should this decision be made on a per-character bases (as in DeltaAccess) or globally only?

Most of the discussion under "Relations between declarative character dependency and special states" seems to deal with developing applications rather than data representation and may well be outside SDD. Sure, document that these dependencies exist and use them to produce high-quality and internally consistent data, but do this at application-level, not SDD-level.

The same applies to "Responsibility for validation." SDD should be about representing data in a standardized way, not about enforcing taxonomic business rules or data quality standards. As SDD will be represented in XML then this XML document must be well-formed and pass a check with a DTD or XML-Schema checker. But SDD shouldn't be responsible for checking the content of individual data items, that's the job of authors and their applications.


So, to summarize:

SDD is about data and only data. These data include characters and descriptions and (optionally) how these descriptions were developed and their current status.

SDD is not about process. Process information can be included but it is optional and can be safely ignored without data loss. [I shouldn't have to process a "use default data" statement to access a complete description, the appropriate state(s) should be inserted into the description when the SDD document is prepared with a note on how it was derived.]

Gregor: It may be in the nature of information that processing has to occur for specific analysis purposes. I believe it is appropriate NOT to code information implied from the taxonomic hierarchy (but not observed directly) in a lower taxon. This information is only available in the higher taxon, and to know its state in a given lower taxon one has to follow certain rules about resolving the path to the implied information. However, processors not doing this are not reduced in their understanding of the validity of the data.

SDD is not about applications. Discussions concerning user-interfaces or processing methods should not be included. [How an application manages and represents these data is completely independent of how SDD represents the data.]

Gregor: If we had time and resources to fully define our requirements in an abstract way and have examples for all requirements: Yes, yes, yes! However, we have not, so applications are used to inform us which requirements are considered to have priority.


Finally, I want to reconfirm that Gregor has done a fantastic job with getting us this far and I support his efforts. My comments are more about a different perspective rather than any serious flaw in logic. Most of what is in the document is well presented and relevant to managing taxonomic descriptions, I'm just not sure it is directly related to the goals of SDD.

Gregor: Thanks for this, and also many thanks for the pains you took in responding in detail. I do appreciate the criticism a lot, regardless of whether we agree or not!

Steve Shattuck; 18. March 2003


Separately, from a later message

Posted to the SDD discussion list by Steve Shattuck on Wed, 26 Mar 2003 08:24:17

TDWG working group: Structure of Descriptive Data (SDD)

[...] A specific example - "Use default state." We are looking hard at this in BioLink and will probably stay away from it, using different functionality to achieve the same thing (rapid data entry and global data changes). We obviously support this when importing from DELTA (and possibly from SDD) but it won't be core functionality for us and we won't export it. To enshrine this in SDD simply because DELTA had it (or some other application currently supports it) may not be the best approach (it may be, but I'm not willing to make that assumption unless we have to).

The same seems to be true for describing uncertainty. Kevin suggests that there are only three "special states":

I only see two here:

PLUS a reason:

To overcome the "machine processing" problem you'll need to enumerate this list and that's fine. But don't build a structure that makes it hard to change that enumeration at any time or as any set of users sees fit. The problem with "special states" is that it overloads "states", which have nothing to do with the uncertainty of the data.

I think this is a priority issue. It is FAR more important to know if this character has been scored for this taxon than to know the reason why it was or wasn't. If you want to then tell me the reason, great. But keep it simple and flexible. It also separates data from metadata along slightly different lines. I consider the state to be data, the reason for coding/not coding to be metadata and want to separate these as much as possible.

Kevin's discussion of attaching uncertainty to states breaks Gregor's SDD model in more fundamentally than what I had proposed. As Gregor suggests, this would result in fundamental changes to the current SDD model. The use of "modifiers" might work, but again, they are fairly tightly associated with natural language representation (as I understand it) and may not generalize well to fulfill Kevin's needs. [...]

Steve Shattuck; 26. March 2003



Return to the SDD starting page.

First published 2003-03-20, last update: 2003-08-28.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser