Home About Contact Archives
Validate the HTML of this page

Coded Character Sets

A Technical Primer for Librarians

Unicode

Unicode is a coded character set that endeavors to provide a unique code point for every character in every language [1]. The goal is to be able to seamlessly store, and share, information in any, or all, languages without regard to platform or application.

MARC Unicode environment then (c.1998-2007)

Initial adoption of Unicode in MARC records had some significant limitations. In essence, the MARC 21 UCS/Unicode environment [2] was simply the MARC-8 character repertoire translated into the Unicode equivalent code points. The rational behind this approach to implementing Unicode was based on the ability to translate MARC data back and forth (i.e. "round trip") between the MARC-8 and Unicode character sets:

"The restrictions in these specifications are intended to optimize the interchange of data encoded using the MARC-8 character sets and UCS/Unicode during the period of transition from a largely 8-bit environment to the 16-bit UCS/Unicode environment. The specifications are built around enabling round trip movement of MARC data between MARC-8 and UCS/Unicode with as little loss as possible."
-- from MARC 21 Specifications [2]
The trade off for being able to round-trip convert MARC record encodings was that only a subset of Unicode was valid in MARC records:
"MARC 21 has established a subset of the full repertoire of characters in UCS/Unicode that is permitted in MARC 21 records at this time. This subset is made up of the UCS characters that correspond to the over 16,000 characters defined in the separate MARC-8 character sets for MARC 21."
-- from MARC 21 Specifications [2]
There were two "classes" of Unicode characters characters that were not valid (at that time) in MARC records. The first class consisted of characters that were not included in any of the MARC-8 character sets [3]. An example would be characters in the Thai alphabet. The other class was the Unicode "precomposed" versions of characters that MARC-8 represented with base and combining characters (i.e. characters with diacritics):

"Modified letters (that is, letters with associated diacritical marks or vocalization marks) would continue to be encoded as a base-letter with an accompanying combining character;"
-- from MARBI [4] Proposal No. 97-10 [5]

"97-10 proposed that MARBI "establish that USMARC records [using UCS characters] use only those listed in the USMARC to UCS mapping." This precludes the use of nearly all precomposed characters in favor of sequences of base character plus combining character(s)."
-- from MARBI Proposal No. 98-18 [6]

While the Unicode Standard is very specific as to the order in which combining characters should appear [6], the MARC specification has taken a more casual approach:

"The Task Force favors storing combining characters in the prescribed Unicode order in Unicode encoded records, but recognizes that conversion of existing [MARC] records may not result in a correct Unicode sequence in certain cases. [...] Two things reduce the importance of the sequencing of multiple combining characters: the infrequent occurrence of characters modified by multiple combining characters, and the variance of practice in existing data."
-- from MARBI Proposal No. 98-18 [5]

MARC Unicode environment now (post 2007)

"To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. [...] however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records subject only to the constraints described [in the current MARC 21 Specifications]." (emphasis mine)
-- from MARC 21 Specifications (revised December 2007) [8]

Notes

  1. Although there are some distinctions between the Unicode Standard and the Universal Character Set (UCS) Standard, the terms are frequently used interchangeably. UCS is defined by ISO/IEC 10646. Specifications for the Unicode Standard are available at the Unicode Consortium home page.

  2. MARC 21 Specifications, Character Sets: Part 2, UCS/Unicode Environment (January 2000, Updated June 2003) (archive copy)

  3. An exception was the Unified Canadian Aboriginal Syllabic character set, which was not defined in MARC-8 but was permitted in the MARC UCS/Unicode environment.

  4. MARBI is the Machine-Readable Bibliographic Information Committee. It is a joint committee under a collection of three constituent subgroups of the American Library Association: ALCTS, LITA, and RUSA. They work closely with the Library of Congress regarding changes to the MARC standard. In addition to a Character Set Subcommittee, MARBI currently has a Unicode Encoding and Recognition Technical Issues Task Force and an East Asian Character Set Task Force.

  5. MARBI Proposal No. 97-10 "Use of the universal code character set in MARC records"

  6. MARBI Proposal No. 98-18 "Unicode Identification and Encoding in USMARC records"

  7. Unicode Standard Annex #15: Unicode Normalization Forms, Decomposition

  8. MARC 21 Specifications, Character Sets and Encoding Options: Part 3, Unicode Encoding Environment (December 2007)