Home Contact Archives
Validate the HTML of this page

Coded Character Sets

A Technical Primer for Librarians

Characters

The letter A The first letter of the English alphabet is the letter "A". The letter "A" is a character, which in this context refers to a graphic symbol used in a writing system. The characters making up a writing system may include letters of an alphabet, ideographs, numbers, punctuation, diacritics, special characters, and other writing marks.

Computers and Characters

On the level of computer hardware (things like CPUs, memory chips, disks drives, SCSI interfaces, and ethernet cables), data is processed, stored, and transferred as a sequence of 1's and 0's. These 1's and 0's are binary numerical code. This includes textual data such as you are reading now. So the uppercase "A" character on your monitor is not stored in your computer's memory as an "A" shape, but rather as binary code such as "01000001". The "01000001" is a stand-in for the abstract concept of "A".

Character Sets

Because computers utilize numerical code and humans read and write characters, there must be some convention to associate the one with the other. That convention is a coded character set which assigns numerical values (codes) to a collection of characters. The term coded character set is often shortened to character set, or code set, or abbreviated as charset. In the Windows operating environment character sets are known as code pages.

The ASCII Character Set

It might help at this point to look at a character set, and a good example is the American Standard Code for Information Interchange (ASCII). ASCII is a 7-bit character set, which limits it to 128 characters (27 = 128). These 128 characters (with their corresponding code points) are also the foundation of many other character sets.

Coded character sets are comprised of two distinct subsets of characters: graphic characters that are elements of a writing system, and control characters (control codes) that are meant to convey computer-specific information (such as a line-feed or backspace). The term character repertoire is often used to refer to the subset of graphic (or printable) characters. Below is the ASCII character repertoire:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . , ? ! : ; ' " ( ) { } [ ] < > 
* # $ % & @ / \ | ` ~ ^ _ + - = 

We can see the numeric code points assigned to the ASCII graphic and control characters by examining the relevant coded character set chart. And alternate way of looking at a character set is to view it superimposed on a code matrix.

The Latin (IS0-8859) Character Sets

The ISO 8859 standard defines a series of Latin character sets. These character sets incorporate the ASCII character set as the first 128 code points, but then extend the set by an additional 128 code points by utilizing a full octet (8 bits) of data per character (28 = 256). Lets take a look at Latin-1, which was designed for west European languages (including English) and is in common use as an internet charset.

Below is the Latin-1 character repertoire. Notice that only 96 of the additional 128 code points are used for graphic characters; the other 32 are reserved for control characters.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . , ? ! : ; ' " ( ) { } [ ] < >
* # $ % & @ / \ | ` ~ ^ _ + - = 
À Á Â Ã Ä Å È É Ê Ë Ì Í Î Ï Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý
à á â ã ä å è é ê ë ì í î ï ò ó ô õ ö ø ù ú û ü ý ÿ
Æ æ Ç ç Ð Ñ ñ þ Þ ß ð « » º ¹ ² ³ ¼ ½ ¾ ÷ × ± ¿ ¡ 
¢ £ ¤ ¥ ¦ § ¨ © ® ª ¬ ¯ ° ´ µ ¶ · ¸ 
We can see the Latin-1 character-to-code-point mapping via the coded character set chart or view them on a code matrix.

Glyphs and Fonts

Each character has an abstract form, and that form assumes an actual concrete shape in a glyph image. Below are six glyphs that, although different looking, are all recognizable as a representation of an uppercase "A" character.

A       A       A       A       A       A

A font is a collection of glyphs used to depict character data. Fonts often have parameters such as size and weight associated with them.