Transliteration Rule Tutorial

Rough Draft 2001-08-21

This is an informal section that describes the process of building a custom transliterator from rules. It does not describe in detail the features of transliteration; instead, it walks through the process of building rules, discussing features needed to perform different tasks. The focus is on building a script transliterator.

Source

The first thing to decide is which system of transliteration to use as a model. There are dozens of different systems for each language and script.

ISO uses a strict definition of transliteration, which requires it to be be reversible. Although the goal for ICU script transliterators is reversibility, they don't have to be. And in general, most transliteration systems in use out in the world are not reversible. In this tutorial will build a reversible transliterator, since it illustrates more of the issues involved in the rules. (For guidelines in building transliterators, see "Guidelines for Designing Script Transliterations" in the ICU User Guide.)

Since most transliteration systems are not reversible, that means that even once you pick a source, you may have to make some modifications for reversibility. Even the very excellent Japanese standard for transliteration requires a few tweaks to make it reversible.

There are various collections of transliteration systems out on the web, such as:

For our example, we'll start with a set of rules for Greek, since it provides a real example that most people are somewhat familiar with from mathematics. We will use rules that do not use the pronunciation of Modern Greek; instead, we will aim for rules that correspond to the way that Greek words were adopted into English. For example, we will transliterate "Βιολογία-Φυσιολογία" as "Biología-Physiología", not as "Violohía-Fisiolohía". To illustrate some of the trickier cases, we will also transliterate the Greek accents that are no longer in use in modern Greek.

Note: Some of the characters may not be visible on your screen unless you have a Unicode font with all the Greek letters. If you are licensed for a copy of Microsoft Office, you can use the Arial Unicode MS font, or you can download the Code2000 font for free. For more information see Display Problems? on the Unicode site.

We will also make sure that the mapping is complete; that is, that every Latin letter maps back to some Greek letter. This insures that when reversing the transliteration, all Latin letters are handled. Notice that this direction is not reversible. So in summary, the goal is:

Source-Target Reversible φ → ph → φ
Target-Source Not (Necessarily) Reversible f → φ → ph

Basics

In the simplest cases, we just have a one-to-one relationship between letters in Greek and letters in Latin.

The simplest rules just map between source string and a target string; for example:

π <> p;

This rule says that when you are transliterating from Greek to Latin, convert π to p; when you are going the other way, convert p to π. The syntax is just

string1 <> string2 ;

So, we will start by adding a whole batch of simple mappings. These won't actually work yet, but we will start with them. For now, we won't worry about the uppercase versions.

α <> a;
β <> b;
γ <> g;
δ <> d;
ε <> e;
ζ <> z;
η <> ē;
θ <> th;
ι <> i;
κ <> k;
λ <> l;
μ <> m;
ν <> n;
ξ <> x;
ο <> o;
π <> p;
ρ <> r;
σ <> s;
τ <> t;
υ <> y;
φ > ph;
χ > ch;
ψ > ps;
ω <> ō;

Quoting

All of the ASCII characters except numbers and letters are reserved for use in the rule syntax. Normally, you won't need to convert these characters. If you ever do, you can use either single quotes or a slash. For example, to convert from two less-than signs to the word "much less than", you can use either of the two rules:

\<\<   >   much\ less\ than ;
'<<'   >   'much less than' ;

Notice that where we needed a real space in the rules, we had to quote it. If you want a real backslash, you can either double it \\, or quote it '\'. If you want a real single quote, you can double it '', or backslash it \'. Thus each of the following means the same thing:

'can''t go'
'can\'t go'
can\'t\ go
can''t' 'go

Any text starting with a hash mark, out to the end of a line is a comment. This helps to document how your rules work in interesting cases.

Notice that you can insert spaces almost anywhere with no effect on the rules. Using extra space allows you to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. Here is an example:

  ͅ> i ; # an iota-subscript diacritic turns into an i.

If you want to, you can use \u notation instead of any letter. So instead of the Greek π, you could write:

\u03C0 <> p ;

If you find it easier, you can also define and use variables, such as:

$pi = \u03C0 ;

$pi <> p ;

One-way Rules

For completeness, we need to also map back the Latin letters that are not produced by the Greek rules. For example, to have both c and k map to Greek KAPPA (κ), and KAPPA map back to k, one can use two rules:

κ <> k ;
κ < c ;

The first rule is reversible, while the second (from target to source) is not. Internally, the κ <> k rule is actually just equivalent to a pair of one-way rules, so the above could be written equivalently as:

κ > k ;
κ < k ;
κ < c ;

Context

Once we have done the simple one-to-one cases, and the few rules for completeness, it's time to tackle some trickier items. The first is context. In Greek, for example, a γ gets converted to an n if it is before any of the characters γ, κ, ξ, or χ , while otherwise it is converted to a g. Let's take this one step at a time. We could just list all of the possibilities, like this:

γγ > ng;
γκ > nk;
γξ > nx;
γχ > nch;
γ > g;

All rules are evaluated in the order you provide, so this means that the transliterator will first try matching the first four rules, then if all of them fail, use the last one.

However, this method quickly becomes tiresome, especially when you consider all the possible combinations of upper and lower case. An alternative is to use two additional features: contexts and ranges. We'll start by explaining context. Since we already have rules for converting γ, κ, ξ, and χ, the only thing we have to do is to convert the γ differently when it is followed by one of those, and otherwise let those characters be handled by their own rules. This is done with the following:

γ } γ > n;
γ } κ > n;
γ } ξ > n;
γ } χ > n;
γ > g;

A left curly brace marks the start of a following context. That context will be taken into account when matching the rules against the source text, but won't itself be converted. So if we had a sequence γγ, the first γ is converted into an n by the first rule, then the second γ -- not matching any of the first four rules, matches the last and is converted into a g; thus the resulting text is ng.

So far, it seems that we haven't gained much -- we have the same number of rules. Now we will use ranges to clean them up. We can collapse the first four rules into one, as follows.

{γ}[γκξχ] > n;
γ > g;

Any list of characters within square braces will match any one of the characters. We can then add the uppercase variants for completeness, to get:

γ } [ΓΚΞΧγκξχ] > n;
γ > g;

Remember that we can use spaces for clarity, so we could also write this as the following, if we find it easier to read.

γ } [ Γ Κ Ξ Χ   γ κ ξ χ ] > n ;
γ > g ;

If a range of characters happens to have adjacent code numbers, we can just use a hyphen to abbreviate it. That is, instead of [a b c d e f g m n o], we can write [a-g m-o].

Styled Text

There is another reason to use context, instead of just enumerated rules. Transliterators will work on styled text. When they do, they copy the style for the replaced text to the replacement text. But they can only do that on whole replacements, since there is no way to know how any boundaries within the source text would line up within the replacement text.

Thus here are the effects of the two types of rules on some sample text. Notice that contexts preserve the styles at a much finer granularity.

Source Rules
Brute Force Context
γγ > ng γ } γ > n;
γγ ng ng

Case

Let's look at another example where we need context. When converting from Greek to Latin, we can just convert θ to and from th. But what happens with the uppercase theta (Θ)? Sometimes it needs to convert to TH, and sometimes to Th. We can choose between these based on the letters before and afterwards: if there is a lowercase letter after, we can choose Th, otherwise we'll use TH. While this rule is not perfect, it gives satisfactory results in practice.

We could go through and manually list all the lowercase letters, but there is a far easier way to do it. Ranges not only list characters explicitly; they also give you access to all the characters that have a given Unicode property. The abbreviations are a bit arcane, but allow use to easily specify common sets of characters, like all the uppercase letters.

Θ } [:Ll:] <> Th;
Θ <> TH;

That allows words like Θεολογικές to map to Theologikés and not THeologikés!

Note: you can either specify properties with POSIX-style syntax, as [:Ll:], or with Perl-style syntax, as \p{Ll} -- whichever you find more readable.

Properties and Variables

A Greek sigma is written as ς  if it is at the end of a word (but not completely separate) and as σ otherwise. When converting from Greek to Latin, this is not a problem, but when converting back it is. We need to convert an s depending on the context. While we could list all possible letters in a range, there is an easier way -- with a character property. The range [:L:] stands for all letters, so we can use that. But what we really want are all the characters that aren't letters. That can be done with a negated range: [:^L:]. Here is what we get:

σ < [:^L:] { s } [:^L:] ;
ς < s } [:^L:] ;
σ < s ;

These rules say: if an s is surrounded by non-letters, convert it to a σ. Otherwise, if it is followed by a non-letter, convert it to a ς. If all else fails, convert it to σ.

Note: Negated ranges [^...] will also match before the start of a string, and after the end of a string, which makes the rules much easier to write.

Now, you may find the above rules a bit ugly. If you want to make the rules clearer, you can use variables. Instead of the above, we can write:

$nonletter = [:^L:] ;

σ < $nonletter { s } $nonletter ;
ς < s } $nonletter ;
σ < s ;

There are many more properties available, and you can use combinations of them. For example:

Combination Example Description: All code points that are:
Union [[:Greek:] [:L:]] either in the Greek script, or are letters
Intersection [[:Greek:] [:L:]] are both Greek and letters
Set Difference [[:Greek:] - [:L:]] are Greek but not letters
Complement [^[:Greek:] [:L:]] are neither Greek nor letters

For more on properties, see UnicodeSet Properties.

Repetition

Elements in a rule can also repeat. For example, In the following rules, an iota-subscript is converted into a capital I if the preceding base letter is uppercase. Otherwise it converts to a lowercase.

[:Lu:] {  ͅ } > I;
  ͅ > i;

However, this is not sufficient, since the base letter may be optionally followed by non-spacing marks. To capture that, we can use the * syntax, which means repeat zero or more times.

[:Lu:] [:Mn:] * {  ͅ } > I ;
  ͅ > i ;

There are three operators that can be used for this, as in the table below.

Repetition Operators
X * zero or more X's
X + one or more X's
X ? zero or one X

These operators can also be used with sequences, with parentheses for grouping. For example, "a ( b c ) * d" will match against "ad" or "abcd" or "abcbcd".

Technical Note: There is a current limitation on repetition operators. They are always greedy with no backup. What this odd jargon means is that any repetition will cause the sequence to match as many times as allowed, even if that causes the rest of the rule to fail. For example, suppose we have the following (contrived) rules:

a [:L:]* { e } > æ ;
e > é ;

Clearly the intent was to transform sequences like "able blue" into "ablæ blué". It doesn't work, however, and just produces "ablé blué". The problem is that when the left side is matched against the text in the first rule, the [:L:]* matches all the way back through the "al". Then there is no "a" left to match. To have it match properly, you have to rework the rules a bit, to subtract the 'a'.

a [[:L:]-[a]]* { e } > æ ;
e > é ;

Accents

You could handle each accented characters by itself, with rules such as:

ά > á;
έ > é;
...

This gets complicated if you consider all the possible combinations of accents, and the fact that the text might not be normalized. One feature of ICU 2.0 that helps a great deal is the ability to add other transliterators as rules, either before or after all the other rules. The syntax uses a double colon. With this, you can have the rules:

::NFD;
α <> a;
...
ω <> ō;
:: NFC;

What this does is first separate all accents from their base characters and put them in a canonical order. Your rules can then deal with the individual components, as desired. You then use NFC at the end to put the entire result into standard canonical form.

If desired, a filter can also be used with the transliterator, so you could say

:: [[:Greek:][:Inherited:]] NFD;
α <> a;
...
ω <> ō;
:: [[:Latin:][:Inherited:]] NFC;

This will cause NFD to only be applied to Greek characters plus inherited (which are combining marks), and the final NFC to only be applied to letters that are either Latin or inherited, so as to disturb other scripts less. However, this would still disturb the Latin characters that were originally in the text. To limit the actions even more, you can have a global filter at the start. That will cause all of the rules and transliterators to be limited in scope, and only apply to the characters that matched the filter. This would look like:

:: [[:Greek:][:Inherited:]] ;
:: NFD;
α <> a;
...
ω <> ō;
:: NFC;

Disambiguation

Let's revisit what we did with γ, sometimes turning it into an n. Here we hit a little gotcha. If the transliteration is to be completely reversible, what would happen if we happened to have the Greek combination νγ? Since ν also goes to n, we have an ambiguity. Now, normally this sequence does not occur in Greek. However, for consistency -- and especially to aid in mechanical testing, we still want to handle this case. (There are other cases in this and other languages where both sequences to occur.)

To handle this, we use the mechanism recommended by the Japanese or Korean transliteration standards, and insert an apostrophe or hyphen to disambiguate the results. So we add a rule that inserts an apostrophe after an n if the reverse transliteration.

ν } [ΓΚΞΧγκξχ] > n\';

If you look at the Greek rules in ICU, you will see quite a number of these. The ICU rules undergo some fairly rigorous mechanical testing to ensure reversibility. Adding these hyphen rules ensure that they can pass these tests, and handle all possible sequences of characters correctly.

There are some forms that normally never occur in some context (in normal text). By convention, we use "~" for such cases to allow a reversible transliteration. Thus if you had the text "Θεολογικές (ς)", it would transliterate to "Theologikés (~s)". Using this character allows the reverse transliteration to detect it, and convert correctly back to the original: "Θεολογικές (ς)". Similarly, if we had the odd phrase "Θεολογικέσ", it would transliterate to "Theologiké~s". These are called anomalous characters.

Revisiting

Rules allow for characters to be revisited after they are replaced. For example, the following converts C back S in front of E, I or Y. The vertical bar means that the character will be revisited, so that the S or K in a Greek transliterator will be applied to the result, eventually producing a sigma (Σ, σ, or ς) or kappa (Κ or κ).

$softener = [eiyEIY] ;

| S < C } $softener ;
| K < C ;
| s < c } $softener ;
| k < c ;

The ability to revisit is surprisingly powerful. It is particularly useful in reducing the number of rules required for a given language. For example, in Japanese there are a large number of cases that follow the same pattern: "kyo" maps to a large hiragana for "ki" (き) followed by a small hiragana for "yo" (ょ). This can be done with a small number of rules with the following pattern.

First, the ASCII punctuation mark "~" is used to represent characters that never normally occur in isolation. This is a general convention for anomalous characters within the ICU rules in any event.

'~yu' > ゅ;
'~ye' > ぇ;
'~yo' > ょ;

Secondly, any syllables that use this pattern are broken into the first hiragana, followed by letters which will form the small hiragana.

by > び|'~y';
ch > ち|'~y';
dj > ぢ|'~y';
gy > ぎ|'~y';
j > じ|'~y';
ky > き|'~y';
my > み|'~y';
ny > に|'~y';
py > ぴ|'~y';
ry > り|'~y';
sh > し|'~y';

With these rules, "kyo" is first transformed into "き~yo". Since the "~yo" is then revisited, this produces the desired final result "きょ". Thus a small number of rules (3 + 11 = 14) provide for a large number of cases. If all of the combinations of rules were used instead, it would require 3 × 11 = 33 rules.

You can set the new revisit point (called the cursor) anywhere in the replacement text. You can even set the revisit point before or after the replaced text. The at-sign is used as a filler to indicate the position, for those cases. For example:

[aeiou] { x > | @ ks ;
ak > ack ;

The first rule will convert x, when preceded by a vowel, into ks. It will then backup to before the vowel and continue. In the next pass, the "ak" will match, and be invoked. Thus if the source text is "ax", the result will be "ack".

Technical Note: Although you can move the cursor forward or backward, it is limited in two ways: (a) to the text that is matched, (b) within the original substring that is to be converted. For example, suppose you have the rule "a b* {x} > |@@@@@y", and it matches in the text "mabbx". The result will be "m|abby", where | represents the cursor position. That is, even though there are five @ signs, the cursor will only backup to the first character that is matched.

Copying

You can copy part of the matched string to the replacement text. You do this by grouping the text you want copied with parenthesis, and using $n (where n is a number from 1 to 99) to indicate which grouping you want. For people who know regular expressions, this should be familiar. For example, here is a case from Korean. What happens is that any vowel that doesn't have a consonant before it gets the null consonant () inserted before it.

([aeiouwy]) > ᄋ| $1 ;

But then you want to revisit the vowel again, so the easiest way is to insert the null consonant, then the vowel, but then backup before the vowel to reconsider it. Similarly, we have a rule that inserts a null vowel (), if no real vowel is found after a consonant:

([b-dg-hj-km-npr-t]) > | $1 eu;

In this case, since we are going to reconsider the text again we put in the Latin equivalent of the Korean null vowel, which is eu.

Order Matters

Two rules overlap when is some text that they both could match. For example, the first of the following rules does not overlap either of the other two, but the second two overlap.

β > b;
γ } [ Γ Κ Ξ Χ   γ κ ξ χ ] > n ;
γ > g ;

When rules don't overlap, they will produce the same result no matter what order they are in. It doesn't matter whether we have:

β > b;
γ > g ;

or

γ > g ;
β > b;

When rules do overlap, order is important. In fact, a rule could be rendered completely useless. Suppose we have:

β } [aeiou] > b;
β } [^aeiou] > v;
β > p;

In this case, the last rule is masked; there is no text that will match it that would not already be matched by previous rules. If a rule is masked, then a warning will be issued when you attempt to build a transliterator with the rules.

Combinations

Let's take a look at a trickier example that combines a few of these features. In Greek, a rough breathing mark on one of the first two vowels in a word represents an H. (It is invalid to occur anywhere else, so we won't worry about other cases.) In normalized (NFD) form, the rough-breathing mark will be first first accent after the vowel (with perhaps other accents following). So, we will start with the following variables and rule. The rule transforms a rough breathing mark into an H, and moves it to before the vowels.

$gvowel = [ΑΕΗΙΟΥΩαεηιουω];

($gvowel + ) ̔ > H | $1;

So a word like ὍΤΑΝ. is transformed into HOTAN. So far, so good. But this doesn't work with a lowercase word like ὅταν. To handle that, we insert another rule, whereby we move the H over lowercase vowels, we will change it to lowercase.

$gvowel = [ΑΕΗΙΟΥΩαεηιουω];
$lcgvowel = [αεηιουω];

($lcgvowel +) ̔ > h | $1;  # fix lowercase
($gvowel + ) ̔ > H | $1;

This gives us the correct results for lowercase: ὅταν is transformed into hotan. That handles the lowercase situation. But there is a third possibility, a titlecase word like Ὅταν. For that, we need to actually lowercase the uppercase letters as we pass over them, and we need to do that in two circumstances: (a) the breathing mark is on a capital letter followed by a lowercase, or (b) the breathing mark is on a lowercase vowel.

$gvowel = [ΑΕΗΙΟΥΩαεηιουω];
$lcgvowel = [αεηιουω];

{Ο    ̔  } [:Mn:]* [:Ll:] > H | ο;  # fix Titlecase
{Ο ( $lcgvowel * )    ̔  } > H | ο $1;  # fix Titlecase

( $lcgvowel + )    ̔ > h | $1 ;  # fix lowercase
($gvowel + )    ̔ > H | $1 ;

This gives us the correct results for lowercase: Ὅταν is transformed into Hotan. We'll have to copy the above insertion and modify it for each of the vowels, since each has a different lowercase.

That leaves one last tricky situation: a single letter word like . In that case, we would need to look beyond the word, either forward or backward, to know whether to transform it to HO or to Ho. Unlike the case of a capital theta (Θ), there are cases in Greek of single-vowel words, with rough breathing marks. This last one is left for the reader. (Hint: you'll probably use several rules, to match either before or after the word, ignoring certain characters like punctuation and space. Watch out for combining marks also.)