Dirty Data, Fictional Languages And Syntactic Features

The developer of machine translation “Yandex” Anton Dvorkovich about what are the challenges for machine translation in the future. To find online fast and free online translator today easy, but few people realize that to create such a service can afford not every company. Worldwide these companies quite a bit.

You can select Google, Microsoft, Baidu and Russian Yandex. All they initially have in their portfolio search service, and therefore have access to a huge data array, which can teach your translator. However, the data is only half the story. It is important to teach the translator to work for them, and here comes into play in machine learning, which is impossible without serious scientific basis.

Work on research, new solutions and their subsequent implementation in services is conducted in the largest it companies every day. Due to this, the quality of machine translation is growing significantly from year to year, and there is every reason to expect further improvement. From a scientific point of view, machine learning is one of the most interesting areas of applied mathematics.

But even among studies on hot topics like unmanned vehicles, computer vision and various methods of application of neural networks work-devoted, stand alone. For example, one of the most cited articles recently in the field of machine learning is devoted to machine translation and was written by former Intern of “Yandex” Dmitry Bogdanov. This is a very competitive industry and the real “ground” for high technology, and it is not only about correcting existing errors, and gradual improvement in the quality, but about the concept of machine translation in a broader sense.

Will look at the challenges facing developers today and how to solve them. For a machine translator simply translate English into French and Vice versa. The results on this language pair can be called close to human-level translation. The languages of one family, enough on the Internet of texts that are in both languages, and the problems with data input translator does not arise.

But lets put ourselves in China, where a completely different type of writing that ordinary tourists almost impossible to reproduce. How, then, to look for translation of phrases. Among the tips that a machine translator usually offers when entering foreign words, there is clearly not enough.

To solve this problem, came up with several ways. For example, photo and voice translation. And they both pose to developers of translator the whole complex of scientific tasks. In the case of photo input support when we take pictures, for example, the new sign and sent it to the translator to recognize the text in the image help us computer vision technologies.

Largely due to neural networks as computer vision is now very high, and the translation of the text with the pictures today, hardly a surprise. More complicated things with the voice recognition, to perform the which also help a neural network. In this case, its not enough to just display the “heard” by the program text — people expect that the translator will immediately announce it to the desired language.

The problem is that for voice translation is necessary to obtain an acceptable quality in real time or at least with minimal delay. If you look at Siri, it displays your words immediately. The same thing will happen with voice machine translator, if it is today that is not fast enough for real communication. But its not so bad, after all, there is the problem of translation quality.

We are willing to forgive a machine translator any errors in the translation of texts, when you look at them in the browser window, but if you try to voice these texts without prior adjustments, we risk to get the well-spoken, but not always comprehensible set of words. The situation is complicated by the fact that spoken language has its own specifics — were not talking quite the way you write. The problem could be solved, if we could initially construct a machine translator based on the audio data. But translated and dubbed parallel texts in different languages less corny than translated and printed, which is a text machine translation.

So the quality of the translator would be unsatisfactory. So now we have to first recognize said, after converting it to text, translate to the desired language and then voice. But there is good news. All the above problems can be resolved, so in the near future we are likely to see the long-awaited high-quality voice translation in real-time.

Since the learning machine translation is based on the analysis of the same texts translated into different languages, it sometimes leads to some interesting incidents. For example, a machine translator may decide to translate the word “dollar” as “ruble”. This is because some of the texts, which are machine translation, is, for example, pages with product descriptions in different languages in the online stores.

Such pages are often nearly full translations of each other, except that the prices on the English pages are in U.S. dollars, and Russian, in rubles. And the algorithm learns this relationship, mistakenly thinking it was a translation. In the General case this is called the problem of “dirty” data, and in machine translation it is considered serious enough. There is no universal method to solve it, but most often to correct such errors help the user — feedback is so important to the machine translator.

For example, we recently noticed that on our service, the phrase “bye-bye” was translated as “bye-bye”. Digging into the problem, we found the cause. To blame are the sites a long time ago not very good translated by machine translator, which still indexed by search engine.

Apparently, the same problem has faced and Google. In their translator this bug has been fixed with users, as evidenced by the checkmark in the transfer. Another problem is “dirty data” — partially-parallel sentences, they paluuviite. Analyzing a large number of documents, machine translator allocates the parallel texts and on them builds a translation.

Very often such texts are documents which seems to be similar, but not completely. This can be, for example, descriptions of the hotels — they all are based on the template, but differ in details. The algorithm mistakenly takes them as fully parallel and learns. Combating these produplicator conducted in machine translation every day.

Polysemy of words is another serious problem faced by machine translation. For example, “fly” can be translated from English as “fly” or “fly”. There are several options. The easiest to purchase dictionaries from the companies that own them, and use them to correct machine translation.

Such dictionaries are, for example, ABBYY or Oxford. Buying a license can be costly, but are forced to do almost all of the companies involved in machine translation. In “Yandex” has decided to go the other way and developed their own native dictionaries.

Algorithms in the service of machine dictionaries extracted from parallel texts are translations for words and automatically organizes them into rich dictionary with examples and transcription, and for some languages — with examples of antonyms and synonyms. This is not a trivial task at least because of machine dictionary, as opposed to simple translation, expect a very high quality, and from a technological point of view, it is an explosive mixture of machine learning and linguistics. In addition to savings on licences, machine dictionaries, there is another important advantage.

They quickly “learn” new words in the language to add which in a traditional dictionary takes months and even years. There are a huge number of languages, and, of course, they all differ. For example, the order of words in a sentence for a certain pair of languages may be quite different, and the machine must be able to correctly rearrange the words in the translation that did not get the speech of master Yoda.

Of course, if desired, and it can be understood, but it is difficult to name high-quality translation, especially if we want to live in the future with voice translation in real-time. Take, for example, the Turkish language and the correct translation in English: 1) Bize (2) katılsan (3) harika (4) olacaktı → (4) It would be (3) great (2) if you join (1) us.

As we can see, the word order in the translation turned out quite different, and to respect it for the machine translator can be a problem. It is difficult to translate and distinctive or unique words that occur in different languages. For example, in the Russian language there are verbs beginning with “to-“, which means “to do something to such an extent that it led to bad consequences”. Lets take an example from the National corpus of the Russian language.

An excerpt from “Faust” by Turgenev in the original language looks like this: And yet it seems to me that, despite all my experience, there is still something in the world, friend Horatio, that I have not experienced, and this “something” almost the most important. Oh, how I finish. Goodbye!

And here is a German translation: Trotzs all meiner Lebenserfahrung jedoch scheint mir, Freund Horatio, es gebe noch etwas in der Welt, was ich nicht erfahren, und dieses Etwas möchte vielleicht das Wichtigste sein. Doch wo bin ich hingeraten. Lebe wohl!

German hingeraten means, literally, “Oh, where I found myself”; what we are talking about writing the message, understandable only in context. In English, for example, complex expressions like “danced into” — which roughly means “went dancing”. And in French there is such a thing as a double negative.

“hes not coming” — “il ne viendra pas”. For native translators to find the correct translation in such instances very, very difficult now with varying degrees of success translate individual cases (for example, the expression “subscribe” does not cause much difficulty), but there is still a lot. A separate problem is the lack of data, because for any translation need to build a large parallel corpus based on the already translated texts.

Often developers are faced with this problem, when the task of translation from a rare language. To solve this problem, in “Yandex” learned to apply knowledge about the influence of languages on each other. For example, on three Caribbean Islands (Aruba, Bonaire and Curacao) speak Papiamento which is a mix of Spanish, Portuguese and English languages, with seasoning from various African and Jewish dialects, brought here during the Great geographical discoveries.

The translation of “Yandex.The interpreter” understand what words and where they had been taken, which allows the use of model translation not only Papiamento, but also “neighboring” languages. Similarly, for other languages, e.g., Yiddish in many respects similar to the German language, and rare Mari language was influenced by the similar but more popular meadow Mari and Russian languages. So, if to understand, what is language, and correctly identify the right pieces of such a language designer, you can learn to translate even with fictional languages, which are not so much described by the author of the words, but we can imagine how new words might look like. This same logic of “Yandex” in the past year have taught your translator to work with the elven language.

While machine translation helps to understand the meaning of a text in a foreign language or translate a simple phrase, with a translation of the cope it was not yet. However, even in this scenario, machine translators are a great help to the translator-the person. It is much easier to correct any inaccuracies in the ready machine translated text than to translate the text from scratch.

In this way employs professional platform automated translation, for example, MateCat. By the way, the data about the corrections that generate live translators, invaluable to the machine and play a big role in their improvement. Using these data as well as user feedback and tireless work of researchers, introducing new technologies, there is a gradual improvement of machine translation in General. It is not a quick process, but today we are closer than ever to a future where communication between people will be one barrier less.

Send your column about how our world will change, [email protected]

Leave a Reply