SDD proposal: Categorical data types

TDWG working group: Structure of Descriptive Data (SDD)

CHECK wether integrated:

Categorical (= nominal or ordinal)
nominal, ordinal: linear/branched (tree)/ digraph (= directed graph, incl. cyclical)
states: red, yellow, blue
Nominal
Shape-type
leaf-shape-type
Ordinal:
Decided on additional state:
nominal / ordinal-discrete /ordinal-interval

nominal = finite set, unordered categories (equivalent to a JAVA "Set")

ordinal-discrete = ordered categories, no intermediate states possible (equivalent to an "Enumeration" in programming languages). In contrast to cardinal states (e.g. counts) ordered categories have no defined distances.

ordinal-interval = underlying values are able to have many intermediate stages, but for the purpose of the description the value space has been classified into (more or less arbitrarily) distinct classes Examples are color: Red, orange, yellow, where an intermediate color b/w orange and yellow can be observed, or "glabrous, with few hairs, hairy, very hairy" Such observations would best be expressed as a measurement, which often is, however, practically not possible. ordinal-interval should be analyzed as being ordered without having equidistant intervals.

In these cases, an observation orange to yellow can signify either:
the value is relatively constant and lies at a point intermediate b/w orange and yellow,
or the values are variable and range from orange to yellow.

Mathematical expression: maps unto a partitioned continuous interval

-------------------------

Introduction

Categorical data is data with values from a discrete and finite set. Categorical data can either be based on a naturally discrete feature, or can be derived from a categorization ("partition") of a continuous numerical variable. The categories of a categorical character should ideally have no elements in common (i. e. be disjoint).

Occasionally the term "qualitative data" is used to represent categorical data. However, the term is also used for verbal pieces of data (interviews, opened ended questionnaire items), and other less structured situations. The term categorical data should therefore be preferred.

Categorical data are closely related to list data (e. g. to code a list of host plants, pollinators, etc.), except that the value range is very large and therefore not definable (unlimited). Ultimately, such data should be retrieved ("look-up or pick list values") from data providers that specialize in the management of these data.

The majority of descriptive data today belongs to categorical data. Some reasons for this are:

DELTA defines two types of ordinal data: unordered multistate (UM) and ordered multistate (OM). Additional states like cyclical have been proposed for DELTA-2.

Proposal: Categorical data types

1. Basic types (ordinality)

Three distinct classes of categorical characters can be distinguished:illustration of leaf venation terms

Clarification: The "ordering" of an ordered character applies to the biological semantics of ordered states, not to their appearance in object descriptions. In an object description a character with two states may appear as "1 or rarely 2" or as "2 or rarely 1", regardless of the ordering defined in the terminology. See "Scoring sequence of states in descriptions" for further information.

2. Subtypes of the "ordinal-discrete" categorical data type

As mentioned above, the ordinal-interval is in practice restricted to linear ordering. For the discrete ordered type, linearly ordered data are the most frequent type as well, and the only one supported in DELTA. However, other orderings (branched tree or cyclical graphs) are useful in many cases. A data type CYCLICAL was proposed in the DELTA-2 proposal, and the NEXUS phylogenetic data standard supports branched trees like

                 3
                 |
               1-2-4

A general proposal to describe states arranged as the most useful kinds of directed graphs: trees, cycles, and, more generally, directed acyclic graphs is available: Character state graphs (Bob Morris).

In Brazil a consensus was reached to represent every kind of ordering with a graph. Note: The import routines in an application need to provide for cycle detection. An advisory Boolean attribute "acyclic" could be added to the standard, but this may have been set erroneously, so any import process needs to do cycle detection anyway.

Current practice with DELTA-like applications (CSIRO DELTA programs, DeltaAccess, Pankey, etc.) treats non-linearly ordered ordinal characters as unordered nominal characters (Delta "UM"). In my experience it is interesting to note that the most frequent case is probably not that all state relations are known and these are branched, but that for part of the states a plausible or even proven ordering hypothesis exists, but for some states this is not the case.

illustration of leaf margin terms

Example: leaf margin = "entire", "crenate", "dentate", "serrate", and "lobed" are all related. Only some ordering is present, e. g.

          entire -- lobed
           |  \     /
           |  crenate
           |  /
         dentate --- serrate
This mixture of ordered and unordered assumptions could be expressed in a graph. In this example the effort would problably not be worth it, but this remains the decision of the designer of the terminology.

3. State ordering with variable distances

NEXUS allows the definition of state transition matrices, to define transition probabilites. This is, for example, desirable when inferring a phylogenetic hypothesis from DNA sequence data. Any of the four nucleotide states (ACGT) can directly mutate to any other. However, the mutation frequency is not identical (transitions versus transversion) a fact used in distance methods to calculate a Kimura distance. In parsimony analysis (e.g. in PAUP) a similar effect is produced by directly entering all transition likelihood (as "steps") between the nucleotide bases.

Note that any finite graph with and without weigthing of edges can be represented either in a matrix or in the proposed xml graph representation (Morris, pers. comm., 2003). The NEXUS matrix representation and an SDD xml graph should therefore be equivalent and able to document the same data.

An option to define wheights on the graph edges should be be added to the SDD proposal.

Request for discussion

Please send your criticism or suggestions to the SDD mailing list or to the author

.

Gregor Hagedorn; Vers. 1; 6. Feb. 2003



Return to the SDD starting page.

First published 2002-11-30, last update: 2003-03-11.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser