rickharrison.comArtificial Language Lab

On the unsuitability of "logical languages"
for use as interlinguas in machine translation

by Rick Morneau

extracted, with the author’s permission, from messages posted to the Conlang mailing list

Ultimately, all languages are computer-tractable. Some are just easier to deal with than others. In my opinion, Lojban is much less tractable than it should be, especially since it claims to be a serious candidate for use as an interlingua (IL) in machine translation (MT). Also, as I discuss below, Lojban has features that actually make it unsuitable for use as an MT IL.

Before proceeding, I’d like to emphasize that my comments below apply only to Lojban’s claim that it is suitable for use as an MT IL. It is not my intent to criticize any of Lojban’s other claims, such as its suitability for testing the Sapir-Whorf hypothesis, for use as a world interlingua, for encouraging speakers to speak more logically, etc.

Here are some of the reasons why I feel that Lojban is poorly suited for use as an interlingua in machine translation.

required pauses

Lojban claims that its words are self-segregating. Obviously, this feature is not needed for analysis of written language, but it can greatly simplify the analysis of continuous speech. Unfortunately, Lojban requires the use of pauses in certain places in order to fully implement this feature. Enforced pauses are unnatural and, should Lojban ever attain a community of native speakers, these pauses will be one of the first things to disappear. (It has been argued that these pauses could be replaced by glottal stops. In some cases this is true, especially when one word ends in a vowel and the next starts in a vowel. However, there are some cases where a glottal stop will not work: 1.) if a word ends in a voiced nasal and the following word starts in an unvoiced fricative, stop or affricate; 2.) if a word ends in an unvoiced fricative and the following word starts in an unvoiced fricative, stop or affricate. In these situations a glottal stop will either be impossible to detect, or will be eliminated through normal phonological processes.)

complex syntax

Lojban syntax is too complex. Regardless of the syntactic formalism you swear by (transformational grammar, government-binding theory, generalized phrase structure grammar, lexical functional grammar, homebrew grammar, ad nauseam), a natural language minus its idioms and irregularities can be represented with an equivalent of fewer than three dozen production rules. Lojban syntax has 600-odd rules, about twenty times more than minimally necessary. Lojbanists claim that it is machine-parsable, and I’m willing to take their word for it. However, an MT IL should have a syntax that is as simple as possible. A simple syntax not only makes it easier to parse the IL, but more importantly, it makes translating from the source language to the IL much easier.

place structure

Lojban’s predicate logic is not very "logical" in the way it is used to represent natural language. (It may be "logical" for testing the Sapir-Whorf hypothesis, but this has no bearing on its use as an MT IL.) Its assignment of place structures is too arbitrary and inflexible for use as an MT IL. In most natural language processing applications, a sentence is represented using case frames or a close equivalent. (In brief, case frames are a practical and elegant implementation of basic X-bar theory, which, in my opinion, gives tremendous credibility to its claim of cross-linguistic applicability.) Lojban’s inflexible place structures and BAI bandaids are not only counter-intuitive, but they force the computer to treat language structures differently when they are essentially the same. Each predicate, in effect, has a built-in irregularity which requires extra processing by the computer.

limited reversibility of compounds

In a machine translation interlingua, compound words must be assembled in a completely reversible manner. The computer must be able to break up the word into an equivalent phrase or clause. In other words, the computer must be able to generate a paraphrase of the relationship between the more primitive components of the compound. You can, of course, put this information in the dictionary, but this solution is not at all practical if you want to keep your dictionaries simple, and if you want to have one dictionary per natural language usable for both source and target translation. An IL designed for use in MT must be maximally and reversibly compositional.

A compromise between this reversibility and viability as a speakable human language can be achieved. For example, the English compound "houseboat" can be decomposed as "boat which functions as a house." The compound "windowpane" can decomposed as "pane which is part of a window." In a machine translation application, the relationships "which functions as" and "which is a part of" must be explicitly stated in the compound word. This can always be done with the addition of a single morpheme which would, in effect, link the component morphemes and indicate the relationship that exists between them. This would normally mean an additional syllable (and, of course, an appropriately designed morphology), and apparently many people would object to this for aesthetic reasons. However, one additional syllable is a small price to pay when the potential reward is so high. Without this simple "sacrifice," an AL will be useless as an MT IL.

deviation from natural grammar

Finally, a logical language is inherenly unsuitable for representing natural language. Lojban is called a logical language for good reason – it forces a speaker to express himself according to various rules of logic. Natural languages do not require a speaker to be logical in the same way. As a result, when translating from a natural language into Lojban, the computer will often have to fully understand what the speaker is saying (to "fill in the gaps," so to speak), which is well beyond the capabilities needed for normal disambiguation. It is also well beyond the capabilities of computers.

Lojbanists reply that languages are not always ambiguous in the same ways, and that the "neutral framework of predicate logic which Lojban employs, being equally foreign to all natural languages, forces ambiguity to be squeezed out before a correct translation can be generated." How can "being equally foreign to all natural languages" be anything but an impassible barrier? An interlingua designed for use in machine translation must be, as much as humanly possible, a reductive and fundamental distillation of the essential features of natural language. Not even the slightest degree of "foreignness" can be tolerated.

In summary, I do not feel that Lojban (or Glosa, or Esperanto, or Vorlin) is suitable for use as a machine translation interlingua, in spite of claims to the contrary. Most importantly, I see nothing in Lojban that would facilitate the most difficult aspect of machine translation: translating from a natural language to the interlingua. What I do see is an AL that has so little in common with natural languages that translating between it and a natural language will be considerably more difficult than translating directly between natural languages. And this does not surprise me at all, considering that Loglan/Lojban was designed to test the Sapir-Whorf hypothesis. Such a language, by its very nature, would be the antithesis of what is needed for an MT IL, no matter how "logical" it is.