SDD document: Choice of appropriate term for inferring information on object or class trees
TDWG working group: Structure of Descriptive Data (SDD)
Introduction
An important consideration in descriptive data is that data in the object or class tree (specimen, subspecies, species, genus) are related for evolutionary as well as for logical reasons (a genus is defined as a set of species). The process of combining object descriptions to produce a parent description can be manual, as it is common in genus and family descriptions, but it can also be performed by a machine (computer program). The latter may be a dynamic process (as far as I know to be implemented in BioLink) or may generate a static copy of data (as it is done in the CSIRO DELTA programs and in DeltaAccess).
In the discussion of some decisions about the proposed general structure of character and state information we need to refer to this process. Which term shall we use?
Current usage:
- Kevin Thiele uses "data collation" and "data inheritance".
- BioLink uses "Compile From Children" / "Inherit from Parent" (see below, in earlier discussions it was called "inferred from below" / "inferred from above"; S. Shattuck, pers. comm. 2002).
- DeltaAccess uses "Summarize"
- CSIRO DELTA uses @@sorry, could not find the correct term in the manual, but there is a functionality present@@
- Microsoft Excel uses "consolidate data" to create a spreadsheet that combines informations from multiple spreadsheets.
In the SDD minutes and other documents so far I have largely adopted the term "collation" with a sprinkle of my old "summarize". Bob Morris thinks that "collation" is inappropriate, meaning "put in order". After looking the term up in a dictionary, I tend to agree. Bob proposes "aggregation".
Available terms
Quotes from Collins English Dictionary:
- Aggregate: adj. 1. formed of separate units collected into a whole; collective; corporate. [...] noun. 3. a sum or assemblage of many separate units; sum total. vb. 7. to combine or be combined into a body, etc.
- Collate: 1. to examine and compare (texts, statements, etc.) in order to note points of agreement and disagreement.
[C16: from Latin collatus brought together (past participle of conferre to gather)]
- Consolidate: 1. to form or cause to form into a solid mass or whole; unite or be united. 2. to make or become stronger or more stable.
- Infer: 1. to conclude (a state of affairs, supposition, etc.) by reasoning from evidence; deduce. 2. to have or lead to as a necessary or logical consequence; indicate. 3. to hint or imply.
- Summarize: to make or be a summary of; express concisely. Summary: a brief account giving the main points of something
Quotes from Oxford English Dictionary (OED Online):
- Consilience: The fact of jumping together or agreeing; coincidence, concurrence; said of the accordance of two or more inductions drawn from different groups of phenomena.
- Collation: 2. The action of bringing together and comparing; comparison.
Comments
Gregor Hagedorn:
- Aggregate: term seems to be appropriate, used for similar processes in databases (aggregation functions).
- Collate can be misunderstood and bears an element of active interpretation, choice and comparison. If the process of inferring data from child elements is highly automated, it would be inappropriate. In current computer use, "collation" on the printer menu only refers to the order. In databases, "collation sequence" refers to the method by which characters are ordered, especially non-ASCII characters like ü (order after u or after z? If after u, before or after úùû?). Also, the term "collate" is probably unknown to most non-native English speakers.
- Consolidate seems to be suitable and used in programs like Excel. It has not yet been used in biological or descriptive-terminology though.
- Infer: Very general term, easily misunderstood. Inference process can refer to any analysis performed on descriptive data. However, this seems to be the only term that is able to capture the fact that parallel to aggregating/collating information from child objects, it is also possible to infer information from parent objects. If a character is scored on the family level, it can be assumed, that it applies to a member of the family. Note that whereas inference from below is a necessary relationship, inference from parent is not necessarily possible. A family characteristic may be missing in rare object classified in that family (i. e. the family description is somewhat "fuzzy")
- Summarize implies an abbreviation process. Appropriate insofar that consolidating/collating/aggregating 10 specimen descriptions into a single species description is a summarization of information. Bears an active element of being selective, however, like "collation".
Bryan Heidorn:
Martin Grube:
- consilience might serve a plurilingual word
Bob Morris:
- Consilience: I don't think this conveys that several concepts are being collated (which I don't like because it often implies a natural order). Rather it would seem to imply that several things mean or imply the same thing.
- Second meaning for 'collation' in the OED might in fact meet our needs. So maybe I should withdraw my dislike.
Kevin Thiele:
- Consider this in the real world. If I have a number of descriptions of species, and from them I want
to create a genus description, then I would collate the genus description from the species
descriptions. Compile and aggregate would also work, but I think collate works better.
Guillaume Sauvenay:
- Couldn't we call that induction? here is a definition: "Induction is the process of inference employed in "inductive logic". A mode of reasoning that starts with specific facts and concludes general hypotheses or theories". Here, species descriptions are observed fact and genus description is inducted. And this term is less general than infer.
Guillaume Rousse:
- For me inference, induction, deduction sound like specific methodologies, whereas collate, aggregate and compile correspond to the generic process. In other words, you can collate/aggregate/compile either by inference, by induction, or by deduction.
- BTW, if the collate/aggregate/compile process can be formalized and at least expressed as a recommendation in the standard, it becomes useless to store its result as it can be reproduced.
Ben Moretti:
- I would call it "abstract". Rather like one creates an abstract data type in programming, or one can also create objects that are abstract representations, or models, of real-world things.
Gregor Hagedorn:
- In response to Guillaume: I feel uneasy about calling the inheritance or deduction process an "aggregation" or "compilation". That may be a language point. The other way round (induction, up in the tree) sounds ok to me. Also I would think:
- induction = parent properties inferred from multiple children
- deduction = child properties inferred from (direct or remote) parents
- What then would the definition of "inference" as a third specific methodology be?
- I think we need to explore this up/down topic on the taxonomic tree. Some information you want to inherit (i.e. child inherits from parent), other characters you don't because the character is too variable. You also want to contradict, but that is probably automatic if you enter data. However, if you enter data in a character downwards, you probably want the inheritance process from parent to stop, the new data resets the collection of states present in the entire character.
- In induction: You may have data about 100 specimen. 99 have states 1 or 2, 1 has state 3. You may want to remove state 3 from the aggregation process. Also: you may have a family where states 1 and 2 can be inferred by aggregation/induction/collation from below. You know however this family has also members with state 4 (which are
otherwise not in you SDD). Should the aggregation break or should it be possible to add state 4 and keep 1 and 2 by
aggregation/induction?
- Perhaps the process of making informed scientific decisions should be called the collation process. The algorithmic methods could be called:
- aggregation = induction of knowledge
- inheritance = deduction of knowledge
Peter Rauch:
- If "inducing" (not "inducting") a genus concept ("description") is unacceptable, then how about "synthesizing" a genus
description ("concept") from the data garnered from the to-be-included species definitions (concepts)? (And, do other
"related" genera/species influence one's thinking in the formulation of this new genus's concept/description too?)
Steve Shattuck:
- In BioLink, we have the following options/commands:
Compile, with the options to:
- Compile Once
- Always Compile From Children
- Refresh Compile
Inherit, with the options to:
- Inherit Once
- Always Inherit From Parent
- Refresh Inherit
- We do this independently for each cell or attribute (that is, for each character for each taxon/item, or for each character by taxon/item intersection). This way you can build a full description by compiling some characters from children while inheriting other characters from the parent.
- We allow you to either regenerate these data automatically (by selecting the "Always Compile/Inherit from Children/Parent" command) or to create a static copy of the data so you can fine-tune it by manually inserting
comments and clarifications (with the danger that changes to children/parents will not be reflected).
- Our hope is that by including the phrases "from children" and "from parent" in the commands that we can make the actions of these commands clear without having to resort to looking these terms up in the help files or, worse, a dictionary.
Discussion in Lisbon (TDWG 2003)
The issue was taken up at the meeting in Lisbon. The following proposals were added to the list above:
Martin Pullan:
- use abstraction for turning species description into genus descriptions etc.
Bob Morris:
- use generalization instead of of abstraction.
Nico Franz:
- Synthesis
- ascendence (up), descendence (down), aggregation (same level) could be 3-tuple of terms
Conclusion
A vote which terms are preferred was taken in Lisbon and resulted in the following terminology:
Inference is the preferred head term to refer to the following processes:
- aggregation = induction of knowledge from several specimen descriptions to a species of subspecies (= class description). This is considered as producing a new description on the same level (since specimen are identified as taxa).
- generalization = induction of knowledge from a lower level class description (e. g. species) to a higher level class description (e. g. genus). The process is thought to be an active process that changes the data and has to resolve contradictions.
- inheritance = deduction of knowledge from higher level to lower level descriptions (e. g. from family description to genus, or from species description to specimen properties)
As a minority vote Gregor Hagedorn maintained that inferring information from specimen to species is not structurally different than inferring information from species to genus. Consequently the same term should be used. A higher taxon description results from the knowledge about which lower classes are contained and is no independent information or derived at from external principles. Although during the identification process the higher taxon description may operationally be treated as an independent logical expression to determine whether a lower taxon is a member of a higher taxon (e. g. in a key in a Flora), it is in fact not independent. If phylogenetic analysis shows that a genus so far only known to contain leaf-bearing plants also contains a species in which leaves have been reduced, the conclusion is that the genus description must be changed, not that the species belongs to a different taxon. This is fundamentally different from ontological statements or definitions for languages, where if an object does not belong to the agreed definition of, e. g. a chair, than it simply is not a chair.
Request for discussion
Please send your criticism or suggestions to the SDD mailing list or to any of the authors.
Gregor Hagedorn; Vers. 3; 29. October 2003
Return to the SDD starting page.
First published 2003-03-11, last update: 2003-10-29.
