Example-based Machine translation??

The nature of blogs is that they are semi-off-the-cuff. So don’t hold me too closely to this one. But The Bible Translator’s Assistant’s methodology is probably closer to a smart example-based machine translation system than to a regular machine translation system. Example-based MT typically uses a huge corpus of bilingual texts to “train” its translation engine. For example, it might use the English Wall Street Journal corpus with an “aligned” Spanish corpus. The paragraphs and sentences are more-or-less in the same order in the two corpora. So a computer program (by the way I was an innovator in this field back in the 1990s at Carnegie Mellon) goes through and figures out that “Microsoft acquired ABC” is translated as “Microsoft adquiere ABC” in Spanish.  It’s easy to see that if the computer ever needed to translate “Microsoft acquired ABC” in a different document at some later date, then the computer would know how to do it. Really easy, right?

The trickier part is using the same input pair of sentences and a little common sense (which is in short supply for computers) to know that “Microsoft acquired DEF” is probably translated “Miscorsoft adquiere DEF” – even though that wasn’t the exact sentence in the original corpus.  And even trickier, given two pairs of sentences:

Microsoft acquired ABC                                      Microsoft adquiere ABC

And

John went to the store yesterday                     Juan fue a la tienda ayer.

then if in a new document the computer needs to translate “Microsoft acquired DEF yesterday” the computer can guess with some confidence that a good translation would be “Microsoft adquiere DEF ayer.”

So, the whole example-based process is: 1) learning the translations for small phrases (and even words) using the bilingual corpus, 2)  combined with a smart algorithm for stitching together bits of translations into the translation for a whole sentence.

TBTA takes this one step further. One BIG step. We have defined an unambiguous, straight-forward semantic language and are encoding the whole Bible into it. Additionaly, since we designed the semantic language, we know EVERYTHING that needs to be said in it. In fact, we know everything that CAN be said. For example, we know we can say these kinds of things:

John built the building.

John will build the building.

John might build the building.

John finished building the building.

Etc.

girl with crossed arms
You want to know how to say what?

So what we in TBTA do is go to a target language (that needs an Old Testament translation), and we build our own specialized and highly targeted bilingual corpus. We figure out how to say all the above sentences in the target language, and about 300 other sentences too, which, in their entirety, will let us say ANYTHING WE WANT TO SAY in that target language!! TBTA then utilizes a really smart algorithm to stitch bits and pieces of translations together. For example, if we want to translate “Jesus finished speaking to the crowd” we can take bits of translations that tell us how to handle “finished”, “Jesus”, “speak”, “crowd”, etc, and weave them all together to make a good translation for the whole sentence.

How does that compare to regular machine translation? First of all, “regular” translation will try to translate from – for example – the NIV or KJV (or Greek or Hebrew) scriptures. Any kind of human language is extremely ambiguous and almost impossible for a computer to understand reliably. So that’s the first problem, and it’s a huge one. In fact it makes any kind of regular machine translation of the Bible into smaller languages impossible to do accurately (using “regular” techniques). The second problem is that the linguists and programmers need to try to figure out hundreds of rules that translate English into the target language. This is really only possible for major language pairs (like English and Spanish) for which millions of dollars and hundreds of man-years of research have been invested. Other machine translation approaches try to use statistical techniques similar to example-based translation, but they are impractical for small languages because there is little or no bilingual corpora available between, say, English and the small target language.

OK – I hope that wasn’t too boring. I think it’s cool and I especially think that it is cool that we can use this methodology to accurately and quickly translate the Old Testament for potentially thousands of languages.

 

Advertisements