SDD proposal: Numeric data types

TDWG working group: Structure of Descriptive Data (SDD)

Content

  Introduction
  Multiple statistical "measures"
  Numeric precision
  Xml representation of numeric values
  Measurement units

Introduction

A fundamental categorization of observations distinguishes between nominal, ordinal, cardinal, and interval measurement data. The first two are categorical data and are discussed in "Categorical data types". The latter two numerical types are discussed here.

Cardinal data are used where both an order and an equal distance between the possible values are present (ordinal data define only a ranking of values, but no distance). Countable features are the most frequent use of this data type. It corresponds to DELTA type "IN", integer numeric.

The interval scale data type is used for continuous variables where any number of intermediate values are potentially present in the data. Note that occasionally it is important to define the data type based on data representation rather than the nature of the underlying process. For example, length measurements are potentially always on the interval scale, but if represented through coarse classification ("plant < 2 m, 2-5 m, > 5 m") they are ordinal data, and if rounded to the nearest meter they may be more appropriately handled as cardinal data.

In numeric as well as in categorical data, data can be represented in 3 forms:

Data 
type:
nominalordinalcardinalinterval

Uncompiled (original measurement list in order of observations, = "raw data")
redthird31.6
greenfirst12.4
redthird31.6
bluesecond25.1
greenfirst11.2

Compiled, with frequencies (ordered list)
2 x green2 x first2 x 11 x 2.4
1 x blue1 x second1 x 21 x 5.1
2 x red2 x third2 x 32 x 1.6
1 x 1.2

Compiled
greenfirst12.4
bluesecond25.1
redthird31.6
1.2

Footnote 1: the ordering of nominal data is operational (as defined in terminology), not inherent in the data type.

Footnote 2: frequencies are rarely useful in interval scale. If they frequencies seem to be useful one should become suspicious about the nature of the data: for example, 23 x 1.691, 12 x 2.382, 4 x 4.073 probably are the result of a cardinal process of counts 1, 2, and 3 with a fractional scaling factor!

Footnote X: To maintain the order of observations is relevant only for internal purposes, esp. proof reading against lab books, etc. More relevant it the fact that observation may be made together, this is discussed separately.

I do not know whether a commonly used term for the three presentation levels shown in the table above exist (please tell me if you know any!) For the purpose of this discussion it shall be called "compilation status" Repeatability subobjects, single, countable (2 eyes, 10 feet), uncountable (leaves on tree) there can be multiple concurrent and competing object hierarchies: morphological partition, anatomical or functional partition (vascular bundles, skin), method-dependent partition (margin of culture on Petri-dish after 7 days at 20 °C). However, these always have properties about countability, extent, delimitation (sharp or fuzzy delimitation?)... data type (e.g. a length , e. g. where the selected a measurement

They correspond to the technical data types signed Integer and real numeric values.

The distinction between cardinal and interval data is interesting for certain types of analysis, but may be ignored in implementations, since the mean, standard deviation, etc. of integer variables is a real numeric value. Only in the case of singular original data (the count of bristles on a single individual) would the data type be actually integer.

Multiple statistical "measures"



In Brazil, Paris, and Lisbon the correct name for univariate statistics was considered. @@ADD links@@ Frequent terms are measure, value, statistical parameter, statistic. One problem is that statistical terminology differentiates between summary information about the entire population itself (= "parameter") and summary information about a sample from the population (= "parameter estimate" or "statistic"). The neutral term covering both cases is "measure" (see J. H. Zar 1984. Biostatistical Analysis, 2nd ed., Prentice Hall, NJ, USA: p. 16).

In SDD we want to be able to express both sample statistics like a variance with df=n-1 and population parameter like a variance with df=n. The term measure is, however, so wide that it is not restricted to univariate descriptive statistics of numerical variables, but also includes single categorical observations. The term "statistical measure" seems adequate, even though it is long.



Numerical values in descriptions have both a data type (e. g. integer or real numeric) and a property defining their "statistical semantics". Examples are: "mean", "sample size", "minimum", "standard deviation", "lower border of 5%-confidence interval". This definition of a numeric state within a character is very similar to categorical states. However, the generic statistical semantics need to be expressed separately, since these "states" are constantly reused and have identical semantics in all numeric characters.

A generally accepted term for these "numeric states" is not known to the members of the SDD group. Proposals for terms where were:

"Encyclopedia Britannica" uses the term "statistical measures" for "mean", "standard deviation", etc. The measure concept does not fully apply to single values or sample size, but the similarity was considered sufficient.

From "New Oxford Dictionary of English":

parameter noun technical: a numerical or other measurable factor forming one of a set that defines a system or sets the conditions of its operation. Mathematics: a quantity whose value is selected for the particular circumstances and in relation to which other variable quantities may be expressed. Statistics: a numerical characteristic of a population, as distinct from a statistic of a sample. (in general use) a limit or boundary which defines the scope of a particular process or activity: the parameters within which the media work.

statistic noun a fact or piece of data obtained from a study of a large quantity of numerical data

measure verb [with obj.] 1 ascertain the size, amount, or degree of (something) by using an instrument or device marked in standard units or by comparing it with an object of known size

Request: please notify us if you can make an argument for a better term!

Some measures (ranges, confidence intervals, or percentiles) have paired values. These are treated as separate measures in the SDD model and it is up to the application to handle the pairwise-relationship where necessary.

At the meeting in Brazil we agreed to introduce MeasureDefinitions in the Terminology section, providing a global declaration of available statistical measures.

In addition, these global measure definitions must be enabled in the terminology definition of each individual character. It will be possible to define sets that can be enabled with a single action. It is considered important that the character definition can constrain which measures are enabled. For example, if the character definition enables only the minimum and maximum, the mean can not be entered in any item description. This is not the case in current DELTA, where the 5 numerical measures are always implicitly present without being defined. Already in DELTA this causes problems with the creation of dynamic editing forms, and with undesired input in local applications. If two users are entering data into a numeric character without being properly trained, chances are that one user enters "5-8" as minimum and maximum, the other as lower and upper range value. DeltaAccess therefore imports DELTA data by making all 5 measures explicitly available, and then allows the user to freely remove or add measures.

Each measure can only be enabled a single time for each character. It is not possible to define to "means" in a single numeric character.

Within the item description, the measure is identified through a keyref to the key of the definition within the character (e. g. keyref="leafwidthmean0001" or keyref="327798129"), not towards the key of the global definition (e. g. keyref="mean").

Regarding the relation between the SDD standard and a terminology instance, the following options where discussed:

The last was favored, but deferred until xinclude is widely available to allow testing.

@@@ unfinished @@@ A basic problem is again, that xml does not provide for any mechanisms to inherit content data from a standard list. The SDD group would like to predefine certain measures to maximize application interoperability, while still allowing the user to extend this list. Thus predefined and user definable key value are the same domain referenced by keyref statements.

@@@ unfinished @@@ Discussion how many measure to predefine in the standard and how much to rely on extensibility. Extensibility of measures should be planned, but has not been further discussed. Providing 1% steps in percentiles and confidence intervals would be possible, but bloat the measure list, probably unnecessary. A compromise was searched, see @@@@@@@@@@@@

Numeric precision

As decided at the TDWG 2002 (Brazil) meeting, the SDD standard will only support the highest numeric precisions currently in normal use.

For real numeric values the minimum precision is double precision (8 byte, value range -1.79769313486231E+308 to -4.94065645841247E-324, 0, 4.94065645841247E-324 to 1.79769313486231E+308). Applications importing and re-exporting SDD data should not reduce the numeric precision below these values (e. g. using single precision real numeric variables to save space).

For integer values the minimum precision is a signed 32 bit integer with a range of -2147483648 to 2147483647. A given implementation may store these values in a larger (e.g. 64 bit) integer, but should not export values beyond this range. We currently do not expect to require integer counts in biology with a larger range and which are not suitable to be handled as real numeric values.

Note: The need to support integer values at all is currently an open issue (means of integer are real numeric values). Also, if only a limited amount of calculation is performed within the application and proper rounding is implemented, it is possible to handle integer values with double precision real numeric values internally.

Xml representation of numeric values

The rendering of numerical values varies between different languages and cultures. The most important difference is that some cultures use a decimal point, and others a decimal comma. Unfortunately, most cultures use the decimal point of the other culture to separate thousands. Thus, without knowledge about the value domain or the culture in which it was written, it is impossible to interpret the two values "2.639" and "2,639". Either may be "2639" or "2 + (639 x 1/1000)".

The consequence of this is that numeric data should not be treated as "character string" in xml-files, as in "<Measure>23.8</Measure>". The "<Measure value="23.8" />" should be preferred.

Measurement units

A unique property of numeric data is the definition of a measurement unit for each measurement character in the terminology section of the project. DELTA provides a general mechanism allowing to specify units, which is, however, not restricted to numerical characters. Consequently, it is often used to express a general "wording after states" rather than an explicit measurement. In DELTA the distinction between a true measurement unit and a wording for natural language reporting is not intelligible for report processors. This limits the output options to those implemented in the application used by the producer of the data. If another application offers, for example, a report with a tabular arrangement, the output may look obscure. DeltaAccess therefore restricts the measurement unit to characters with a numerical type and provides a separate wording after states attribute ("Wording2").

The SDD standard similarly separates between measurement unit and Wording after states. To express the data challenge: "petals 2-4 mm wide", "petals" is defined as wording for the character, "mm" as measurement unit, and "wide" as wording after states.

Note that numeric measures differ in whether the measurement unit is applicable to them or not. The unit is applicable to any central value statistic (mean, mode, median) and to range/interval statistics (min./max., confidence interval, etc.). It does not apply to variance statistics of sample size.

We probably need an attribute in the numeric measure definitions to define this? Not yet done!

Recommendation and measurement unit examples: It is recommended that the user interface offers a pick list with frequent measurement units to guide the user in the selection of appropriate units and to highlight the method provided to code superscripts or subscripts. However, the user should be able to freely enter any measurement unit (i. e. this should be implemented as a combination of text and pick, which unfortunately is not available in html forms!).

Common measurement units (from pick list in DeltaAccess version 1.8): m; dm; cm; mm; µm; kg; g; mg; µg; L; ml; µl; nl; °C; M; mM; µM; nM; pM; Mol; mMol; µMol; g/ml; Pa; hPa; bar; mbar; mm Hg; mS; µS; °; '; "; in; ft; µm<sup>2</sup>; mm<sup>2</sup>; cm<sup>2</sup>; m<sup>2</sup>; km<sup>2</sup>; µm<sup>3</sup>; mm<sup>3</sup>; cm<sup>3</sup>; m<sup>3</sup>; km<sup>3</sup>


In the future SDD could refer to units defined through URI names. This is especially important in the case where units are context sensitive (minute and second as angular units or units of time), or multiple definitions exist (miles: nautical mile, English land mile, US survey mile). An attribute UnitDefinitionURL could be added to SDD as soon as the evolving standards are stabilizing. For more information on units see: Units in MathML. W3C Working Group Note 10 November 2003

Gregor Hagedorn; Vers. 1; 7. Febr. 2003



Return to the SDD starting page.

First published 2002-11-28, last update: 2003-03-08.

Valid XHTML 1.0! Valid CSS1! Viewable With Any Browser