fontSizeChange
  • A+
  • A
  • A-
Science, Technology  and Innovation  Policy 2020

Science, Technology and Innovation Policy 2020

Read More
Give your feedback on  National Research Foundation 2020

Give your feedback on National Research Foundation 2020

Read More

This is the third blog in the series Machine Translation: An Overview by Dr. S. K. Srivastava.

Though the developments are discussed in three phases in the national scenario, the periods differ. Researchers started working on machine translation in the eighties and that is why the Early Phase starts from the eighties and continues till 2005 when TDIL launched a Mission on language technology covering all the related fields including machine translation. The Mid-Phase started in 2005 and continued till around 2016-17 when people started using NMT.

Early Phase

The government of India has been sponsoring R&D projects in this area since the eighties. In 1986, when MeitY (erstwhile Department of Electronics) launched the Knowledge-based Computer Systems (KBCS) Programme two nodal centers for R&D in language technology were established at TIFR and NCST Bombay (now CDAC Mumbai). While TIFR primarily focused on speech processing, NCST was working in natural language processing including machine translation. During the late eighties and early nineties, it worked on the development of MaTra, a system for translation from English to Indian languages in the news domain. When C-DAC Pune was established in 1989, another center under KBCS Programme was created there and one of the focus areas of this center was natural language processing in Indian languages. In 1991, MeitY initiated Technology Development for Indian Languages (TDIL) Programme for supporting R&D to develop language technologies including machine translation. Since then, most of the machine translation projects were supported by TDIL Programme. 

Research groups at several academic institutions in the country have been working on machine translation since the eighties. At IIT Kanpur, a strong group in machine translation existed since the eighties and worked on several major projects. In the AnglaBharati project undertaken by the group, a system was built for translation from English to Indian languages. The system used rule-based Machine Translation (RBMT) but later on integrated with the modules based on an example-based approach. The system was built in a way that it could be customized for any Indian language. AnuBharati system developed at IIT Kanpur was complementary to AnglaBharati. It translates text from Indian languages to English. This too uses example-based machine translation as well as rule-based technology in some of the modules.

Anusaaraka project was another pioneer machine translation project initiated at IIT Kanpur in the late eighties but later, it moved to the University of Hyderabad and then to IIIT Hyderabad. The main spirit of the project was to develop a language accessor rather than machine translation in the true sense. It provides information in the target language but follows the grammar of the source language and therefore, the translated text may be grammatically incorrect. It was able to translate text from the languages Bengali, Kannada, Marathi, Tamil, Telugu to Hindi. Though it was a domain-independent system, it was primarily tested on children’s stories. Anusaaraka uses principles of Paninian Grammar and makes use of the similarity among the Indian languages in translating the texts. The other important projects of IIIT Hyderabad are Shiva and Shakti which were jointly developed by CMU, IIIT Hyderabad, and IISc Bangalore for translation from English to three Indian languages viz. Hindi, Marathi, and Telugu. Shiva is an example-based machine translation system whereas Shakti uses a combination of rule-based statistical machine translation techniques. 

C-DAC Pune has been working in the natural language processing (NLP) area since the late eighties. Specifically, the group focused on the analysis and development of tools for the Sanskrit language, being the origin of several Indian languages. Later, this group worked on the development of MT systems. The group used the approach of Tree Adjoining Grammar (TAG) promoted by the NLP group at the University of Pennsylvania. The translation system, MANTRA developed by C-DAC translates text from English to Hindi. There are two versions MANTRA-Rajbhasha which was built for translation of official documents of governments and MANTRA-Rajya Sabha which was developed for the Rajya Sabha secretariat to translate materials used in the house such as a list of business, etc. The development of these systems was supported by the respective offices i.e. Department of Official Language and Rajya Sabha secretariat. MANTRA has been included in “The 1999 Innovation Collection” on Information Technology at Smithsonian Institution’s National Museum of American History, Washington, DC. 

The NLP Group at C-DAC Mumbai has been working on machine translation since the eighties. They developed the MANTRA system for the translation of texts from English to Hindi in the domains of news, annual reports, etc. The system uses a transfer-based technique for machine translation. Later on, they developed a statistical machine translation system as part of the Anuvadaksh system which combines translation engines developed using different techniques. 

A group at IIT Bombay has been exploring interlingua-based translation. The institute is a member of the Universal Networking Language (UNL) project which is an international initiative aiming to develop Interlingua for all major human languages. The group has developed machine translation systems for English-Hindi, English-Marathi, English-Bengali language pairs using UNL formalism. 

Mid-Phase

In 2005, MeitY launched a mission for language technology development to deliver deployable solutions. It provided funds for research and development to the institutions working in the areas of language technology such as machine translation, speech recognition, and synthesis, and optical character recognition for Indian languages. It was decided to provide funds to the consortia of the institutions rather than individual institutions. Several consortia of institutions were formed for different areas like machine translation (English to Indian languages), machine translation (Indian languages to Indian languages), speech technology (both recognition and synthesis), and optical character recognition for Indian languages, etc. On the completion of the mission period, the second phase of the mission was launched to continue the support to language technology development through the same consortia of institutions.

Under TDIL Mission on language technology, the SAMPARK system was developed by a consortium of institutions led by IIIT Hyderabad for translation from Indian languages to Indian languages. The system uses computational Paninian grammar formalism for analyzing Indian languages. The system uses modules that use rule-based techniques as well as statistical machine learning techniques to implement the components of machine translation. The system has been developed for 18 language pairs from 9 Indian languages. The base system has been kept general purpose which can be optimized for any domain. According to the SAMPARK team, the effort involved in customizing to a new domain is limited to primarily, new domain dictionary, rules for handling domain-specific grammatical structures, and retraining of the modules for Part of Speech tagger and Named Entity Recognizer. 

The second major translation project under TDIL Mission is Anuvadaksha which translates from English to a set of Indian languages viz. Bengali, Bodo, Gujarati, Hindi, Marathi, Oriya, Tamil, and Urdu. The system was developed by a consortium of institutions led by C-DAC Pune. It works in the domains of agriculture, healthcare, and tourism. The system uses multiple translation engines based on TAG, SMT, EBMT, and AnalGen. Its pre-processing module uses different sub-modules viz. morphological analyzer, parts of speech tagger, named-entity recognizer (NER), word-sense disambiguation, NP chunking, and clause identifier. The post-processing module uses the sub-modules for morph synthesis, synonym-selection, etc.

Some projects were initiated to develop language resources for Indian languages which enable machine translation systems to work. One such project is the Indian Language Corpora Initiative (ILCI) which has been developed by a consortium of institutions led by Jawaharlal Nehru University (JNU). The languages covered under the initiative are Hindi, English, Bangla, Gujarati, Konkani, Tamil, Telugu, Kannada, Malayalam, Marathi, Punjabi, Odia, Urdu, Assamese, Nepali, Manipuri, and Bodo. A parallel annotated corpus of 100,000 sentences has been created in the domains of tourism, health, agriculture, and entertainment for each of these languages with Hindi as the source language.

Machine Translation R&D at JNU has collaborated in the development of English-Urdu as part of Microsoft consultancy and was released on the Bing platform in 2013.  Baseline SMT systems for English-Sindhi and Sanskrit-Hindi, both done on the MT Hub platform, were done as part of doctoral research at the School of Sanskrit & Indic Studies, JNU.

Another important language resource developed under the Mission is IndoWordNet which is a lexical database for Indian languages on the pattern of WordNet and EuroWordNet. In IndoWordNet, Hindi is at the root and all other Indian language wordnets are linked through an expansionist approach. The resource has been built with the contributions of several researchers belonging to different institutions led by IIT Bombay. 

These were good prototype systems but could not be deployed at user organizations due to several reasons. The only exception is the MANTRA system for Rajya Sabha which is being used for quite some time. However, it works in a specific context where terminologies are limited and most of the sentences are similar in structure. This has remained limited to only one section out of more than 10 sections where translation work is done. However, activities have created a strong R&D base in the country. Several students who worked on these projects are working on machine translation projects in renowned IT companies.

Recent Phase

In the recent past, several groups have started exploring neural machine translation techniques for translation from English to Indian languages and Indian language to Indian languages. For instance, a group at IIT Patna has been working on an NMT-based system for translation from English to Hindi in the legal domain. 

On the industry side, there have been some good developments in the recent past. Some startups are working on machine translation. For instance, Devnagari, a startup based in Noida is offering translation services to several clients in the corporate sector using an NMT-based system. The company has several thousand translators on its panel and they look into the machine-translated version for necessary corrections. Reverie is another startup which is providing translation systems for Indian languages. 

The scenario of MT as a service has transformed during the last 2-3 years due to the availability of language pairs on the MT platforms being provided by the major IT companies like Microsoft, Google, IBM, Amazon, etc. Table 1 shows the Indian languages supported by the popular MT platforms. Several private entrepreneurs are providing translation as a service to the corporate sector using these platforms. In most of the cases, they are using the MT services to get the first translation draft followed by vetting by empaneled human translators. These platforms also provide the flexibility to customize the baseline systems using a corpus of the domain. Once the parallel corpus of the domain increases, the accuracy of the systems will also increase.

Some public institutions have already developed MT systems or are in the process of getting them developed. All India Council for Technical Education (AICTE) has signed an MOU with IIT Bombay to develop systems for the translation of technical education textbooks from English to Indian languages. These textbooks will be used in the institutions which are to offer technical courses in Indian languages. Similarly, Supreme Court AI Committee has developed a system SUVAS (Supreme Court Vidhik Anuvaad Software) with technical assistance from the IT industry. The system translates text from English to 9 major Indian languages and vice-versa. The system has been installed in several High Courts in the recent past. 

Table 1 : Indian languages supported on commercial translation systems

S. No.

Translation System

Indian Languages Supported 

1. 

Google Translator

Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Sindhi, Tamil, Telugu, Urdu

2.

Microsoft Translator

Assamese, Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu

3.

Amazon Translator

Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, Telugu, Urdu

4.

IBM Translator

Bengali, Gujarati, Hindi, Malayalam, Nepali, Tamil, Telugu, Urdu 

5.

Facebook Translator

Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Urdu

 

Private sector institutions are using machine translation to a significant extent, primarily by availing of the translation services being provided by translation companies. However, when we look at the other side, we find that most of the public institutions are still using manual translation or using freely available translators to do the first level of translation. There are some exceptions like the two mentioned above. The obvious question is how the public institutions could be supported in adopting machine translation. The prototype systems developed by academic institutions are not in a form that these could be directly used by the user agencies. At the same time, public institutions can’t use open-source MT platforms and public datasets to develop MT systems customized for their domains. They need the support of specialized technical teams to do this.

At this point, there are two directions for moving forward and these are not exclusive to each other but should be taken up in parallel. One is undertaking further R&D for making use of the new technologies especially neural machine translation and several advances made during the recent past. Though several teams have reported improvements in the quality of machine translation in Indian languages, there is still a need to explore this area fully. The other direction involves developing systems for Indian languages using open-source NMT tools and datasets already available in the public domain.

We are in a fortunate position in the space of machine translation where we can leverage several things which have happened in the recent past. Many top universities like Harvard and Edinburgh which have been working in this area for a long time, have put their translation platforms in an open-source domain. Already these have been used by some Indian companies to develop machine-aided translation systems. Some startups have started giving translation services to user organizations. 

Further, a significant step has been taken within the country by some organizations in making the large size of Indian language resources available in the public domain. IIT Madras in collaboration with EkStep Foundation, AI4Bharat, and Tarento, has created Samanantar, a large parallel dataset for Indian languages, and has made it available in the public domain. While a part of the corpora has been built by collating existing corpora from several institutions, a significant part has been built by the mining of texts from the Internet. Though datasets have been made available in the public domain by several institutions in the past also, the creation of such a large language dataset at a single point is being seen as a milestone.

As per the report published in April 2021, Samanantar contains corpora (parallel to English) in 11 languages viz. Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. Out of these, nine languages (other than Assamese and Oriya) have more than 2 million sentences which is the baseline size for training NMT systems. Apart from the dataset, the team has also trained various language models and created several benchmarks. All these have been made available in the public domain for the benefit of the natural language processing community.

Other blogs in this series:
Summary and Abstract
1.    Machine Translation: Introduction
2.    Machine Translation: The International scenario
3.    The National Language Translation Mission

Acknowledgement

While preparing the article, several experts were consulted for their views and suggestion. Their contributions are gratefully acknowledged. In particular, I would like to thank Dr. P. K. Saxena and Prof. G. N. Jha for their suggestions. 

Disclaimer

The views presented in the article are those of the author and not the Office of Principal Scientific Adviser to the Government of India. Any comments/suggestions may be sent to the author at sks@meity.gov.in.

topbutton

Back to Top