Unified Biosciences Information Framework (UBIF) XML schema library. See the main UBIF.XID file for complete information, copyright and licensing. This library has no targetNamespace or namespace; it can be included into UBIF as well as into other schemata and acquires the target namespace of the including schema (chameleon pattern, see http://www-106.ibm.com/developerworks/library/x-flexschema/ or http://www.xfront.com/ZeroOneOrManyNamespaces.html).

=== Public objects carrying a key also generally provide for developer annotations/comments (undefined language), version extensions for future versions of UBIF, and custom extensions (= "application annotations"). Version extension (Ext), CustomExtensions, and Annotation/comment free-form text. Internal notes/management comments (not multilingual). Annotations should be displayed only in a 'designer' or 'revision' mode' and are expected to be invisible to users who only want to consume or apply the data. They are appropriate for rough, unedited comments, but should not contain confidential information. Extension mechanism to implement forwards compatibility in a new version of the standard (i. e. old applications can process newer data versions; compare backwards compatibility using optional elements anywhere). Community extension mechanism, e. g., for application-specific data. To allow forward compatible extensions of UBIF and derived schemata, an extension container for the target namespace is provided for the use by the designers of the schema. Only the developers of the standard namespace may place elements here! This provides an extension mechanism to the standard model that may be used, e. g., to store application-specific data. Recommendation: UBIF applications that both import and export data may implement the loss-less round-tripping of data. The information of all imported custom extensions, even if those that are not interpreted, should be preserved as string and later exported unchanged. Each custom extension contains xml content defined in another namespace. This may either be application-specific, or several applications may agree on common custom extensions. [ATTR: name, version] The content or CustomExtension is not further validated against a schema by validating xml processors. However, it must be well-formed xml and it is not possible to directly store a text string (content model mixed="false"). Identifier chosen by the target application(s) for which the content in the extension container is intended. The only purpose of this attribute is that application(s) generating a type of custom extension recognize the target identifier, while other applications just pass this through. Optional information about which version of the custom extension definition has been used. === Key/ref infrastructure for linking within a data set: This allows to define (and redefine) the value type for keys and keyrefs Note: the use of attribute groups instead of globally defined and referred attributes is a work-around for problems occurring with attribute definitions in included library schemata. The use of global attributes by ref caused validation or namespace problems, even though this library has no target namespace (chameleon pattern); Spy 2004.4 says, e. g., ... attributes that need to be qualified because your schema uses attributeForm = qualified or global attributes. You must specify a prefix for your schema namespace. An optional attribute to add a human-readable equivalent to the numeric primary identity key, intended to simplify debugging. The attribute can be discarded or updated at any time. Applications should not produce exports containing this attribute, instead it can be generated using xslt (based on labels/abbreviations). An optional attribute to add a human-readable equivalent to the numeric ref to simplify debugging. A debugref always points to the associated debugkey. The attribute can be discarded or updated at any time. Applications should not produce exports containing this attribute, instead it can be generated using xslt (based on labels/abbreviations reached through key/ref). === Options to link using URLs or GUID + resolving mechanisms (used especially for UBIF data proxies): The object linking mechanisms used by the ProxyBase type may also be used by other objects! LifeScience ID (without the constant prefix 'urn:lsid:'). 3 to 4 parts separated by colon, the 1st part is the url of a life science authority service that provides metadata on how to obtain the object references in part 2 (namespace = data collection), 3 (object ID) and 4 (optional object version). Example: lsid.gbif.org:DataCollectionID:ID/1§31~b+:v2 Digital object identifier (an ID scheme advanced by the library community). A URL directly providing an object representation. In contrast to the URN types LSID or DOI this should resolve directly. The URL may be a query string (with ID embedded), for example: "http://x.y.fr/pub/au=smith?yr=1998". In the case of URLs multiple definitions may be defined to reduce the likelihood of failure. [The element sequence in instance documents is informative and should be preserved.] === Basic type library: === Basic generic types normalized string required to contain at least 1 character (this removes the xml string anomaly, i. e. either element/attribute may be optional, but if they are required the content may not be an empty string) normalized string restricted to 1..50 character length to be used for abbreviations (the recommended length of abbreviations is usually much shorter, but 50 characters should be a normalized string restricted to 1..255 character length (i. e. required, may not be empty string) Double precision numeric value in the range of [0..1] Colors defined as RGB (red-green-blue) values combined as hex-encoded into a string, like in html. Example: #EE88FF. Colors may also be expressed as HSV (hue-saturation-luminance), but this is convertible to RGB. RGB is preferred because it is used in HTML. Html also allows a shortend version with only 3 hexadecimal values. A pattern supporting both would be: #(([0-9]|[a-f]|[A-F]){3}|([0-9]|[a-f]|[A-F]){6}) Derived string type with restricting patterns Life Science ID (= string restricted by a regular expression pattern). Annotation of the pattern: 5 to 6 parts separated by colon 1. The string URN (case-insensitive) 2. The string LSID (case-insensitive) 3. AuthorityID = DNS token with at least 2 parts plus a top-level domain with 2-5 characters (case-insensitive) (In earlier LSID specs this was assumed to be a DNS name; the final spec. however says: "The authority identification is usually an Internet domain name. In this case it is recommended that it be owned by the organization that assigns an LSID in question. Such organization is responsible for ensuring the uniqueness of the string created from the namespace, object and revision identifications. In the case where the authority identification string is not an Internet domain name, the authority should take care to ensure that it is a unique string and if possible, register that unique string with the organization that is currently the authority for the URN Namespace Identifier (NID) "lsid"" 4. Data collection identifier/namespace: non-whitespace characters except colon (case-sensitive) 5. Object ID: non-whitespace characters except colon (case-sensitive) 6. Object version (optional): non-whitespace characters except colon (case-sensitive) Earlier, more specific specs at http://www.i3c.org/wgr/ta/resources/lsid/docs/LSIDSyntax9-20-02.htm had more restrictions on authority (DNS name!) and fewer characters beyond US-ASCII and digits. A pattern matching the earlier spec. was (extended with "local"): pattern value="[Uu][Rr][Nn]:[Ll][Ss][Ii][Dd]:((local)|(([0-9A-Za-z\-]+\.){2,}[A-Za-z]{2,5}))(:[0-9A-Za-z][0-9A-Za-z\(\)\+,\.=;$" _!\*'\-]+){2,3}" Example: urn:LSID:www.gbif.net:DataCollection.Namespace:ID/$17+731_b:v2.1 Compare LSID, this omits the prefix 'urn:lsid:' Digital Object Identifier (standalone, not embedded into URI syntax) Pattern based on http://www.doi.org/handbook_2000/enumeration.html#2.2 which states that all DOIs start with "10." then a free prefix, then "/", then suffix. An additional constraint not expressed here but possible to implement would be that in the Appendix the pattern "\S/" (single character followed by slash) for the suffix (i.e. after the first slash) is reserved for future extensions. String containing a format pattern of the type used in the xslt format-number function A generic or higher taxon name (monomial) under the bacteriological, botanical, viral, and zoological code, with a pattern to fulfill the following rules: a) First character must be upper case [A-Z]; b) Second and following characters must be lower case [a-z], i.e. without accentuation but with e diaresis ("ë") being allowed as an exception in botany; c) From third character on, a hyphen may occur as well. Note that Genus hybrid flags are expected to be stored separately! Based on ABCD, S.Blum 12/2002. W.Berendsohn 12/2003. The rules above should apply to generic names under all codes; if an exception is discovered, the change in constraints should be implemented as an extension [SB]. Note that a maximum length of 255 characters is stipulated to simplify the design of persistent databases [GH]. Notes regarding the admission of ë and hyphen (only for botany):
ICBN St. Louis: Art. 60.6. Diacritical signs are not used in Latin plant names. In names (either new or old) drawn from words in which such signs appear, the signs are to be suppressed with the necessary transcription of the letters so modified; for example ä, ö, ü become, respectively, ae, oe, ue; é, è, ê become e, [...]. The diaeresis, indicating that a vowel is to be pronounced separately from the preceding vowel (as in Cephaëlis, Isoëtes), is permissible.
Bacteriology: Diacritic signs are not used in names or epithets in bacteriology [Rule 64].
ICZN, Article 11: "Mandatory use of Latin alphabet... a scientific name must when first published have been spelled only in the 26 letters of the Latin alphabet; the presence of diacritic marks, apostrophes, diphthongs or the additional letters of the Scandinavian alphabet does not render the name unavailable, but marks must be removed, diphthongs separated and the Scandinavian letters transliterated. " Also: digits or symbols must be spelled out in latin, hyphenation must be contracted.
The pattern should prevent a hyphen as the last character! Two hyphen in a row are still possible, but considered irrelevant. Example: "Epichloë".
A specific or infraspecific epithet name string under the bacteriological, botanical, viral, and zoological code, with a pattern to fulfill the following rules: a) contains only lower case characters [a-z] or e-diaresis (ë). Not that this data type can not be used for cultivar names, which may contain blanks and accented or other letters. The pattern should prevent a hyphen as the last character! Two hyphen in a row are still possible, but considered irrelevant. Example: "vitis-ideae". === The following Range, Date, and Coordinate types describe frequently recurring simple type combinations in a element with attributes -- Element with 2 attributes to define a range: Lower and upper value as required attributes (no default values) Lower and upper probability value as required attributes (no default values) Contains lower/upper estimate attributes; used, e. g., for certainty and frequency! The default values are 0 and 1, indicating that no estimate was possible. -- RGB color polygon expressed as a list of RGB values (these should form a single polygon when connected, which is not validated in the schema!) A single color value or a color polygon defining an area in color space (i. e. not a spatial polygon having a color!) A single point in color space, or multiple points forming vertices of a polygon area in color space. When using a polygon this defines an estimated color range into which the single or variable true color values of the object fall. -- Types for composite gregorian calendar date/time (points in time where parts may be missing; following the seven property model described, e. g., in xml Schema 1.1 (http://www.w3.org/TR/2004/WD-xmlschema11-2-20040716/#theSevenPropertyModel). Instead of gYear, gMonth, gDay integer types with constraining facets are used for two reasons: a) each of them may have a timezone, which may lead to inconsistent data with multiple timezones; b) the lexical representation seems to be occasionally poorly implemented (e.g. where '31', or '---5' are accepted, whereas valid examples are '---31', '---05', and '---05+02:00'). In addition to the seven property model additional text attributes for either unsharp additions or complete verbatim dates are added. Note that incomplete dates in most cases are calendar specific and incomplete non-gregorian dates can not be expressed. Furthermore, for complete dates it may be unclear whether a reformed or unreformed date has been used (e.g. in Russia in the 19th century). Date separated into attributes so that any part of the date may be missing [ATTR: year = four digit year; month = two digit month of year; day = two digit day of month verbatim = unparsed textual date representation supplement = text additional or modifying the exact dates, e. g., 'end of summer', 'first half or year', 'first decade of month', '1888-1892'. timezone = expressed as integer according to the xml schema seven parameter model The four digit year in the Gregorian calendar (in Western cultures usually without a suffix or with 'AD/Anno Domini', 'CE/Common Era'; negative years with 'BC/Before Christ', 'BCE/Before Common Era'). Whether a year 0 is used or not differs between a true Gregorian calendar and recent astronomic usage, xml schema is likely to change its position, see xml schema draft 1.1. Thus database designers should not use 0 as a missing value representation for year. two digit day Text in addition to or modifying the exact date components, e. g., 'end of summer', 'first half or year', 'first decade (of month)', '1888-1892'. An uninterpreted text representation of the original date information (date range, 'summer', perhaps unreformed Russian dates, etc.); as close as possible to the (digital/printed/handwritten) information source. Timezone expressed in minutes. In the seven property model (http://www.w3.org/TR/2004/WD-xmlschema11-2-20040716/#theSevenPropertyModel) the timezone has a range of +/- 14 hours (14 * 60 = 840 minutes). Date + Time separated into attributes so that any part of the date may be missing. [ATTR: see CompositeDate type, plus: time] '24' may only occur if both minute and second are zero (http://www.w3.org/TR/2004/WD-xmlschema11-2-20040716/#theSevenPropertyModel). The normal range should be 0-59, but 60 may occur for UTC leap-seconds (http://www.w3.org/TR/2004/WD-xmlschema11-2-20040716/#theSevenPropertyModel). An additional validator may choose to validate this. The simplest validation would attempt to convert those Composite date instance that containing all seven elements to a xs:dateTime value. -- Types for geographical coordinates Latitude of geographical coordinates in decimal degrees (i.e. 30° 30' would be expressed as 30.5) Longitude of geographical coordinates in decimal degrees (i.e. 30° 30' would be expressed as 30.5) ATTR: latitude, longitude (in decimal degrees), geodeticdatum (esp. if different from a Greenwich-based datum). Longitude is expressed from -180 to 180°, East longitude being plus and West longitude being minus. Where knowledge of the geodetic datum is readily available it should be passed on. However, in most situations no undue resources should be invested into researching the geodetic datum when this is unknown. Many geodetic datum systems result in differences only up to a 100 m, some up to several hundred meters. For many purposes in biodiversity sciences are acceptable. The 'World Geodetic System 1984 (WGS-84)' is the most commonly used geodetic datum. It is used, e. g., by the 'Global Positioning System (GPS)'. Other important systems are used (e. g., ITRF, ETRS89, NZGD2000, OSGB36, ED50, see also http://www.ncgia.ucsb.edu/education/curricula/giscc/units/u015/tables/table03.html or http://www.colorado.edu/geography/gcraft/notes/datum/edlist.html). The differences between WGS-84 and International Terrestrial Reference Frame (ITRF) are in the centimeter range worldwide, and ETRF 89 and NAD 83 are identical to WGS84 for Europe and North America, respectively. -- As an exception to what has been said above are historical coordinates (for most countries up to ca. 1900, much later for France) may be based on a prime meridian other than Greenwich/Airy (e. g., the NTF datum uses Paris as its prime meridian, 2.33723° east of Greenwich). An uninterpreted text representation of the coordinate data (latitude/longitude, UTM, TRS, etc.), as close as possible to the (digital/printed/handwritten) information source. === Various complex types Three attribute provide options to express sex as code (enumerated vocabulary), free-form text (perhaps interpreted), or verbatim (uninterpreted original version). At least one attribute should be present; this can not be validated by the schema. Controlled vocabulary to express sex status for clinical human or biological purposes. The string present in the source database, either in addition to or instead of code (especially no mapping to the controlled vocabulary has been implemented yet, or if a specific value can not be mapped. This differs from verbatim in that it claims no special status and may contain any amount of interpretation relative to the original source (e. g., a specimen label) An uninterpreted text representation of the original sex information; as close as possible to the (digital/printed/handwritten) information source. Telephone, fax, etc. number ATTR: number = should be provided in the ITU Recommendation E.164 international format ("+CountryCode AreaCode Number") (vCard:Tel.Number) ATTR: devicetype = voice, fax, mobile, pager, modem (identical with vCard:Tel.Voice etc.; if several flags apply to a single phone number list the phone number multiple times!) ATTR: usagenote = free-form text for constraints on use e. g. "weekdays only" or "home number" (partly: vCard:Tel.Home/Work flags) ATTR: preferred = preferred number, may occur multiple times for different device types (vCard:Tel.Pref) Numbers should be provided in the ITU Recommendation E.164 international format ("+CountryCode AreaCode Number"). Note that telephone device types are not necessarily exclusive (voice/fax, mobile/modem, etc.) and vCard 3.0 allows multiple for a single number. However, in UBIF this can be represented by adding a single number multiple times for each device type. This attribute should not have a default value voice, even though this is the most likely case. However, an exporting database may not have properly reported the type, or the type may be indicated only in the usage note. Free-form text for constraints on use e. g. "weekdays only" or "home number" (partly: vCard:Home/Work flags) === Extension of xs:language and a reference element using Language Union of xs:language with '-' for language-neutral (e.g. scientific names) and '?' for unknown. Language follows RFC 3066 'Tags for the Identification of Languages': a two-letter code taken from ISO 639 part 1 or a three-letter code taken from ISO 639 part 2, followed optionally by a two-letter country code taken from ISO 3166. (Notes: When a language has both a two-letter and three-letter code, use the two-letter code. RFC 3066 replaces RFC 1766.) Defines an element with a required 'language' attribute Complex types that add attributes 'language' or 'preferred' to the simple types String, String255, anyURI: Note: the use of attribute groups instead of globally defined and referred attributes is a work-around for problems occurring with attribute definitions in included library schemata. (single 'language' attribute) Attribute for Language, used by-reference (single 'language' attribute) Attribute for Language, used by-reference (single 'preferred' attribute) Elements with preferred = true indicate recommendation by the data provider. The consumer may have reasons to make a different choice. Note on current usage: these types are used by ABCD and UBIF, but not by SDD (which uses mostly audiences instead of language) String (i. e. xs:string with minimum length=1) extended with *optional* language attribute String255 (i.e. xs:string with length 1-255), extended with *optional* language attribute String (i. e. xs:string with minimum length=1) extended with *optional* preferred attribute String255 (i.e. xs:string with length 1-255), extended with *optional* preferred attribute String (i. e. xs:string with minimum length=1) extended with *optional* language and preferred attributes String255 (i.e. xs:string with length 1-255), extended with *optional* language and preferred attributes xs:anyURI extended with *optional* Preferred attribute === Some text data support limited xhtml. (Could appropriate elements from xhtml be imported and encapsulated here?) Collection of language-specific label representations Language-specific label representation [ATTR: language] Language-specific simple label, using simple formatted text Label text in a specific language. Restricted to 50 characters maximum length, including blanks (recommended to be shorter!). Label abbreviations are especially important when displaying information in a tabular format. Collection of language-specific label representations Language-specific label representation [ATTR: language] LabelRepr with short inherited Text extended with longer Details text. Optional text of unconstrained length, elaborating details of the ShortText Text with primary language plus multiple optional translations; used, e. g., in PublicationProxy type. A string, e. g. the title of a publication, having a single primary language. [ATTR: language] Translations from the primary language [ATTR: language] === Statements are a special form of complex text expressions Text, optional Details (both free-form text) and optional URI. A concise representation of a statement (copyright, acknowledgement, etc.). Recommended to be as short as possible, but actual length is unconstrained. Optional text of unconstrained length, elaborating details of the ShortText An optional resource on the net providing details on the statement (may be used as an alternative to the long text). A sequence of various intellectual property right (= IPR) statements, with a language attribute on the entire sequence. Other forms of IPR declaration not yet covered (e.g., database rights); also used in cases where an automatic converter can not decide whether a statements is copyright, licence, etc. Copyright may include the information that the data has been released to the public domain. To be used if data are placed under a public license (GPL, GFDL, OpenDocument). Placing data under a public license while maintaining copyright is recommended! (= DC.Rights.Licence; new 2004) Defines conditions under which the data may be analyzed, distributed or changed. "Terms of use" includes concepts like "Usage conditions" and "Specific Restrictions". Disclaimer statement, e. g. concerning responsibility for data quality or legal implications. A free form text acknowledging support (e. g. grant money, help, permission to reuse published material, etc.) === The following types are currently unused (August 2004), but may be used in the future or by other standards. [Unused!] Valid states are true, false, and default. A name whose only value is "default", used for union definitions. [Unused!] Valid states are true, false, and default. A name whose only value is "default", used for union definitions.