Michael Doran Home Page
Contact | Site Map | Search  
  Home > Archives > MARC-to-Latin Perl Routine

MARC to Latin

a charset conversion routine in Perl

Introduction | Source Code


What is MARC to Latin?

It is a Perl routine that takes a string encoded in either the MARC-8, modified MARC-8, or ISO-6937-2 character set and converts it to Latin-1 (ISO-8859-1). MARC-8 is a character set used in MARC 21 (formerly USMARC) records, and ISO-6937-2 is the character set used in FinMARC records. The modified MARC-8 character set is used for a Finnish version of MARC 21.

Why use it?

The main purpose is to correctly render diacritics when displaying MARC record data via the web. The data within MAchine Readable Cataloging (MARC) records can be encoded in a variety of character sets. Because most of these MARC charsets are based on the American Standard Code for Information Interchange (ASCII), the "basic" characters (i.e. A-Z, a-z, 0-9) display just fine in a web environment without any massaging. However, characters with diacritics will not display properly and in the case of MARC-8, non-spacing characters tag-along as unsightly baggage.

Background

This function was written to enable the New Books List (NBL) to operate in a diacritics-rich environment. The NBL extracts data directly from the underlying database of our integrated library managment system. The original NBL solution to MARC-8 encoding was to simply strip out all non-ASCII characters. That work-around eliminated the confusing non-spacing characters, but resulted in no diacritics, although the base letter itself survived (for example, å would appear as a). This solution was better than no massaging of the data, but was obviously unacceptable for non-English collections and/or libraries. See some examples.

The MARCtoLatin.pl routine converts any MARC-8 character that has an equivalent in the Latin-1 extended character set. Since not all MARC-8 characters have a Latin-1 equivalent, some non-spacing characters are still stripped out. But those are rare diacritic combinations, and of course the base letter remains. Latin-1 was chosen because of its ubiquity in the web environment. This whole problem will be moot when all MARC records have been converted to Unicode (and all web clients are configured to handle Unicode), but until then...

Re-inventing the wheel?

I had a strong feeling that there were already tools out there to do what I wanted to do with this function. I looked at some Perl CPAN modules as well as utilities available on the Library of Congress MARC website. The Perl modules handled strings the way I wanted, but didn't appear to offer MARC-8 or MARC8-fin charset choices. The Library of Congress utilities knew all about MARC-8, but were programs for converting complete MARC records to SGML/XML. I just wasn't quite smart enough to figure out how to apply either of them to my particular need. Plus, creating a function was an opportunity to learn a little something about character sets and MARC.

Future enhancements

I am aware that Endeavor uses a legacy-RLIN mutation they call VRLIN, and when, and if, Endeavor is willing to share some information on that, I will likely also include it.

Feedback wanted

This is a first version of the routine, so any feedback regarding bugs, lapses in logic, or improvements would be greatly appreciated.

I will also offer the disclaimer that I am a novice regarding this topic, and so what I have written here may be partly (or wholly) incorrect. Feel free to comment on the content of the page as well as the function itself. :-)