Character Collation Concept Dictionary

Character Collation

The collation sequence definition shall be used by regular expressions, pattern matching, and sorting.

collation - the lexicographic sorting of characters.

lexicographic sorting - sorting of characters based on a dictionary or alphabetic ordering.

A common method for adding characters to an alphabet is to use diacritical marks. In some languages, this creates a completely new character, collated differently from the Latin base. In other languages, these accented characters are collated as variants of the Latin base letter, i.e., they have the same relative order; they are equivalent.

collation value - a number, corresponding to a character, which determines its lexicographic position relative to other characters.

unique collation value - a collation value for a character which is guaranteed to be unique among the collation values of the characters for a given codeset. This concept is most applicable only if the number of weight levels is two.

primary collation value - the collation value of a character applied during the first iteration of the process of comparing two strings.

secondary collation value - the collation value of a character applied during the second iteration of the process of comparing two strings.

equivalence class - set of characters all having the same primary collation value. In general, the characters in the same equivalence class will have secondary collation values that vary.

Collation order is expressed in terms of collation values. This does not imply that implementations shall assign such values, but that ordering of strings using the resultant collation definition in the locale shall behave as if such assignment is done and used in the collation process.

extended collation - either N-to-1 collation or 1-to-N collation. ordinary collation - collation which is not extended collation.

N-to-1 collation - collation in which a sequence of N characters are grouped together as a single unit for the purposes of collation.

1-to-N collation - collation in which 1 character is converted into a sequence of N other characters, for the purpose of collation.

collating sequence - the set of all collation values for a given codeset.

AIX 3.1 had functions, NCcollate(), NCcoluniq(), _NCxcol(), _NLxcol(), which returned collation or unique collation values for various representations of the codeset. These functions were obsolescent in AIX 3.2, and may no longer be supported.

At least five different levels of increasingly complex collation rules can be distinguished: byte/machine code order, character order, string order, text search order, semantic order.

byte/machine order - the collation order is determined by the character encoding.

character order - collation which uses only the primary collation value of characters. Equivalence classes are supported.

string order - collation which uses primary, secondary, etc, collation values of character allowing characters of the same equivalence class to sort identically if subsequent characters of the strings will sort differently. Supports forward and backward searches at different weight levels. (For example, ba'ch before bane, bach before ba'ck, where "a'" means a accent.)

text search order - further refinement of string order in which homonyms are collated together, numbers are collated as if spelled with words.

semantic ordering - words and strings are collated based on their meaning, entire words such as "the" are eliminated.

dictionary order - same as string order.

telephone book order - same as text search order.

POSIX.2 mandates string ordering.

Backward and forward ordering is a requirement defined in the Canadian and other proposed standards for collation.

Far East collations often require contextual information or pronunciation rules which fall outside of the POSIX.2 goals. However, stroke/radical and "most common pronunciation" collation rules can be supported by POSIX.2.

weight level - a particular iteration through the string during the collation process. Each iteration corresponds to applying the collation weights at that level to the characters and comparing those weights.

In string ordering, collation shall behave as if, for each weight level, first substitution is performed (unless the no-substitute keyword applies), and IGNOREd elements removed. Then each successive pair of elements shall be compared according the the relative weights for the elements. If the two strings compare equal, the process shall be repeated for the next weight level, up to the limit COLL_WEIGHTS_MAX.

COLL_WEIGHTS_MAX - maximum number of weight levels used in ordering by string.

collation sequence definition - the LC_COLLATE portion of a locale definition file.

collation order statement - statement in a collation sequence definition consisting of a collating symbol name and optional collation weights.

collating element entry - same as collation order statement. collating symbol - the name of a character enclosed in angle brackets used to identify a character in a collation order statement.

collation keyword - a collation sequence definition keyword in a locale definition file. One of the following: copy, collating-element, collating-symbol, substitute, order_start, order_end.

copy - specifies the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified.

collating-element - defines a collating element symbol representing a multicharacter collating element. This keyword is optional and may be repeated.

collating-symbol - defines a collating symbol for use in collation order statements. This keyword is optional.

substitute - keyword in a locale definition file which specifies a basic regular expression and a replacement consisting of one or more characters and possibly backreferences. During sorting, when a string matching the regular expression is encountered, it it replaced with the replacement string.

order_start - in a collation sequence definition, one or more collation order statements follow the order_start keyword.

order_end - keyword that terminates a collation sequence definition.

If collation values are not specified in one or more collation order statements, then 1) only a primary weight is assigned to the corresponding character, 2) the primary weight is determined by the relative position of the character in the sequence.

The substitute collation feature permits such things as sorting alphabetic month names in logical order.

order_start operand - an operand of the order_start keyword which specifies sort rules. If no operand is specified, one forward operand is assumed. The nth operand is the rule applied when comparing strings using the nth weight. Operands shall be separated by semicolons. The possible operands are: forward, backward, no-substitute, position.

forward - for this weight level, sort from beginning of the string to the end.

backward - for this weight level, sort from end of the string to the beginning.

no-substitute - for this weight level, do not perform substitution.

position - for this weight level, comparison operations shall consider the relative position of non-IGNORED elements in the string, such that, if strings compare equal, the element with the shortest distance from the starting point of the string shall collate first. For example, if hyphen is IGNOREd on the first pass, "o-ring" and "or-ing" will compare equal, and the position of the hyphen is immaterial. On second pass, all characters except the hyphen are IGNOREd, and in the normal case the two strings would again compare equal. By taking position into account, the "o-ring" collates before "or-ing". ellipsis symbol - "..."

The ellipsis symbol specifies that a sequence of characters shall collate according to their encoded character values.

In a collating element entry, weights shall be expressed as characters s, s, an ellipsis , or the special symbol IGNORE.

IGNORE - symbol in a collating element entry which indicates that the corresponding character is removed from the string for the purposes of collation at this weight level.

EQUIV_CLASS_MAX -

multicharacter collating elements - collating element consisting of two or more characters.

collating element - one or more characters that collate together as a single entity.

Collating elements which are specified by a locale definition file may not duplicate a symbolic name in the current charmap file.

collation weight - same as collation value.

multiple collation weights - the set of collation values for a given character including the primary collation value, the secondary collation value, etc.

strcoll() - the equivalent of invoking strxfrm() on each of its arguments and then performing a strcmp().

strxfrm() - transforms an input string into a form suitable for later numerical comparison via strcmp(). The transformation is governed by LC_COLLATE and supports ordering by string.

strcmp() - compares two byte strings numerically. For locale-dependent collation, use strxfrm() to prepare the strings before calling strcmp(), or use strcoll() to perform the collation comparison.

The functions strxfrm() and strcoll() are expensive. If many comparisons are to be done, transform each string initially with strxfrm(), and then repeatedly compare with strcmp().

NLCOLMAX - number of characters in the codeset associated with the current locale.