Doubling Up

September 30, 2008

Posted by Franz Josef Och



Machine translation is hard. Natural languages are so complex and have
so many ambiguities and exceptions that teaching a computer to
translate between them turned out to be a much harder problem than
people thought when the field of machine translation was born over 50
years ago. At Google Research, our approach is to have the machines
learn to translate by using learning algorithms on gigantic amounts of
monolingual and translated data. Another knowledge source is user
suggestions. This approach allows us to constantly improve the
quality of machine translations as we mine more data and
get more and more feedback from users.

A nice property of the learning algorithms that we use is that they
are largely language independent -- we use the same set of core
algorithms for all languages. So this means if we find a lot of
translated data for a new language, we can just run our algorithms and
build a new translation system for that language.

As a result, we were recently able to significantly increase the number of
languages on translate.google.com. Last week, we launched eleven new
languages: CatalanFilipinoHebrewIndonesianLatvianLithuanianSerbian,
SlovakSlovenianUkrainianVietnamese. This increases the
total number of languages from 23 to 34.  Since we offer translation
between any of those languages this increases the number of language
pairs from 506 to 1122 (well, depending on how you count simplified
and traditional Chinese you might get even larger numbers). We're very
happy that we can now provide free online machine translation for many
languages that didn't have any available translation system before.

So how far can we go with adding new languages in the future? Can we
go to 40, 50 or even more languages?  It is certainly getting harder,
as less data is available for those languages and as a result it is
harder to build systems that meet our quality bar.  But we're working
on better learning algorithms and new ways to mine data and so even if
we haven't covered your favorite language yet, we hope that we will have
it soon.