Home About Contact Archives
Validate the HTML of this page

Coded Character Sets

A Technical Primer for Librarians

Character Set Choices for MARC [1] Records

The MARC 21 Specifications allow for character set encodings in either the MARC-8 or the Unicode environments [2]. There are, in addition, MARC record standards promulgated by the Online Computer Library Center (OCLC) [3], and the Research Libraries Group (RLIN) (since merged with OCLC) [4]. It is also possible to encounter MARC records that have been encoded in the Latin-1 character set. And of course, foreign (non-U.S.) MARC standards may specify entirely different character sets.

MARC-8

The MARC-8 character set that most librarians are familiar with is the default "Basic and Extended Latin" character set. The "Basic Latin" component of this default character set is the 128-character ASCII that we examined in the introduction. The "Extended Latin" component of the default character set consists of the American National Standard for Extended Latin (ANSEL) character repertoire [5].

The ASCII character set uses 7 bits per character, so it can only define a maximum of 128 characters. Because Latin-1 utilizes a full 8 bits per character, it can define up to 256 characters. Although the MARC-8 default character set only assigns 256 code points, it can expand beyond the 256-character repertoire (depending on how you define a repertoire) by the use of non-spacing or "combining" graphic characters. These combining characters (hex code points E0-FE) represent diacritics and must be used in combination with a base character. Although MARC-8 is nominally an 8-bit character set, the use of combining characters make it, in essense, a variable-width, multibyte character set.

MARC-8 Basic and Extended Latin (the default character set)

This will make more sense with an example: the "lower case n with a tilde" belongs to the Latin-1 character repertoire and is assigned a single code point (hex "F1"). In MARC-8 there is no single code point to represent that character; instead it is represented by the code point for a tilde (hex "E4"), the combining character, followed by the code point for the small letter "n" (hex "6E"), the base character.
            Latin-1          MARC-8   
            -------          ------   
            F1 => ñ        E46E => ñ
The advantage of this method is that diacritics can be applied to any base character (as appropriate), and more than one diacritic may be associated with a base character, thus expanding the repertoire beyond what can represented with precomposed characters. The disadvantage is that virtually nobody outside of the library world uses MARC-8 character encoding, and thus only specialized library software can properly render the MARC-8 repertoire beyond the ASCII characters. Web-based online catalogs must translate between the MARC-8 encoding of bibliographic records and encodings such as Latin-1 or Unicode UTF-8 that are more standard on the internet.

The MARC-8 Great Escape(s)

The MARC-8 environment allows alternate character sets to be invoked, thus further expanding its character repertoire. Alternate 8-bit sets include Arabic, Cyrillic, Greek, and Hebrew. A 24-bit East Asian ideograph character set can also be accommodated.

MARC-8 default and alternate character sets

There are two techniques specified in MARC-8 for switching from one character set to another, with both utilizing escape sequences to signal a change [6].

The MARC-8 Character Set(s) Versus the Real World

The pioneers of library automation had an impossible (and undoubtedly thankless) task. Libraries may potentially contain books written in any and all languages of the world, so how could that diversity be accommodated within a standard for machine-readable cataloging records? The American National Standard for Extended Latin (ANSEL) was the solution for recording all of the Latin languages and for transliteration of non-Latin languages. The technique for escaping to additional non-Latin character sets further expanded the range of languages that could be used in bibliographic descriptions.

The promise of MARC-8 was that an almost limitless array of characters (over 16,000 if you include the East Asian ideographs) could be used in MARC records. In that sense, the MARC-8 standard was a great success story. However, the flip side of the coin was that, in the real world of software applications and hardware platforms, that abundance could seldom be properly utilized. It was rare that a library application could adequately cope with the text input, processing, and display of non-default MARC-8 characters. When MARC data dared venture outside the specialized sphere of ILS applications, it found itself in a world that for the most part, had never heard of MARC-8, much less made any accommodations for it.

The library community was fully aware of these problems and was making progress towards replacing MARC-8 with Unicode. The road to Unicode, however, took awhile and was not without a few bumps. Things will be clearer once we look at how Unicode in MARC has been implemented.

Notes

  1. "MARC is the acronym for MAchine-Readable Cataloging. It defines a data format which emerged from a Library of Congress led initiative begun thirty years ago. MARC became USMARC in the 1980s and MARC 21 in the late 1990s. It provides the mechanism by which computers exchange, use and interpret bibliographic information and its data elements make up the foundation of most library catalogs used today."
    http://lcweb.loc.gov/marc/faq.html#definition

  2. MARC 21 Specifications, Character Sets and Encoding Options
  3. OCLC-MARC Records
  4. The RLIN MARC Record: Description and Interpretation (archive copy)
    • After February 24, 2001, RLIN MARC records exported from the RLG union catalog were translated into the MARC-8 character set. With one exception -- the dagger symbol ("†") which is not a valid MARC-8 character, was exported as "[dagger]".
    [Note: RLG merged with OCLC in July 2006]

  5. ANSEL is short for "American National Standard for Extended Latin", however the official name for the ANSI/NISO Z39.47 standard is "Extended Latin Alphabet Coded Character Set for Bibliographic Use.

  6. Technique 1 is unique to MARC-8 and provides access to a small number of Greek symbols, subscripts, and superscripts. Technique 2 is based on the ANSI X3.41 (ISO 2022) "Code Extension Techniques for Use with 7-bit and 8-bit Character Sets" standard. See the MARC 21 Specification for details on accessing alternate graphic character sets.