Abstract
Machine translation (MT) from English to foreign languages is a fast developing area of research, and various techniques of translation are discussed in the literature. However, translation from English to Malayalam, a Dravidian language, is still in the rising stage, and works in this field have not flourished to a great extent, so far. The main reason of this shortcoming is the non-availability of linguistic resources and translation tools in the Malayalam language. A parallel corpus with alignment is one of such resources that are essential for a machine translator system. This paper focuses on a technique that enables automatic setting up of a verb-aligned parallel corpus by exploring the internal structure of the English and Malayalam language, which in turn facilitates the task of machine translation from English to Malayalam.
1 Introduction
Machine translation (MT), a subfield of computational linguistics, aims at automatically translating text from one language to another with the aid of computer software. There are different approaches to MT, and among them, statistical machine translation (SMT) needs special mentioning as it uses statistical models to translate source into target language [4]. SMT is a preferred approach to MT over the past two decades, and during this time, huge advancement has been made in the development of SMT models [11]. SMT is a general approach that uses machine learning techniques to automatically learn translations from data. Parallel corpora are used to train statistical translation models, which learn probabilistic translations between linguistic entities like words or phrases. The translation of unseen data is typically formulated as a search problem, where the probabilistic model is used to search for the best translation of the complete input sentence. MT systems for new language pairs are developed by applying SMT models on a new language pair and training them with the new language pair training data. MT shows substantial results and translation accuracy on languages that have similar grammatical structures, such as English and French [3], but the results are not convincing for languages that belong to two different language families like English and Malayalam. SMT has shown remarkable performance in the translation works with English and foreign language pairs. However, the quality of the translations is not commendable when SMT has experimented with English and Indian languages. Many works in Indian languages have been reported in the literature [1, 2], but a reliable and full-fledged machine translator especially for Malayalam language has not been developed yet.
There are challenges that are specific to SMT models. When the differences between the two languages are considerably large, some of the translation patterns become difficult or sometimes impossible to learn in SMT. Typically, larger amounts of data are required in order to learn these translation patterns, due to their higher complexity. Owing to the fact that English and Malayalam belong to two different language families, similar issues may be encountered when English is translated into Malayalam using SMT.
Aligned parallel corpus has great importance in any MT task, especially when considering translation using machine learning techniques. A decrease in the alignment error rate (AER) significantly improves the translation performance rate in any MT system [7]. Various alignment models for SMT, their comparison, and methods to assess the quality of alignments are discussed in the literature [16]. Methods to improve the performance of SMT like incorporating morphological and syntactic knowledge in the training phase [15, 20], methods of annotating the corpora with features or parameters that help in generating translation units [17], etc., guarantee better machine learning results.
Alignments are done at different levels starting from sentence alignment in a document, extending to aligning phrases, and finally aligning words in the phrases. Word alignments are formed by exploring the word correspondences between single words, and context-level information is completely ignored in this approach. Word alignment has been the first approach attempted in translation models, and various techniques of word alignments are mentioned in the literature [4]. Because of the lack of contextual information and difficulty in identifying an exact word-to-word translation match between the target and source language, the translation outcomes from a word-to-word translation model requires refinement. An extension of this approach results in phrase alignments where groups of words that are adjacent to each other in the source language are aligned to groups of words in the target language [9].
Alignments are of different types, and the nature of the alignments depends purely on the characteristics of the target and source language. One-to-one alignment between a target and source language pair is very rare to find. The alignment between English and Malayalam pair is never a word-to-word alignment. The agglutinative nature of the Malayalam language always results in less number of words in a Malayalam sentence when compared to its equivalent English sentence. The alignments set between English to Malayalam are mostly many-to-one or many-to-many.
The problem addressed in this paper is to build a parallel corpus where English verb phrases are automatically aligned with Malayalam verb phrases using a set of handcrafted verb mapping rules. Setting the English–Malayalam parallel corpus with phrase-based alignments plays a major role in MT task. A huge amount of parallel corpora is subjected to machine learning as part of the translation, and all such works require the effort of setting alignments. Developing an annotated aligned corpus by observing the structure and semantics of a language involves a huge lot of man effort. Many works have been reported in the literature for automatically aligning sentences and phrases in a sentence [13]. The expectation-maximization algorithm, a method for estimating parameter values of a statistical model, is a popular training approach in SMT, which extracts hidden alignments from the non-aligned parallel corpora. In this approach, a Malayalam word has an equal chance of getting aligned with an English word in its corresponding parallel sentence. A drawback of this approach is that it leads to a number of insignificant alignments and increases training time. An observation made concludes that along with the techniques discussed in the literature, some language-specific features need to be considered when designing the alignments between English and highly inflected and agglutinative language like Malayalam [19]. The agglutinative nature of Malayalam also demands a preprocessing phase prior to any machine learning approach. A mechanism to automatically align corresponding verb units in English Malayalam parallel text by exploring their syntactic and semantic similarities is discussed here. The knowledge of inflections, suffixes, and sandhi rules in Malayalam and their association with grammatical units in English is employed for this task. The following sections of the paper are organized as follows: A discussion on the issues in setting alignments between English and Malayalam pair is covered in Section 2. Section 3 discusses the methodology and the algorithm in detail. The results and observations are mentioned in Section 4. The paper concludes with Section 5 by summarizing the work and mentioning the future scope.
2 Alignments Between an English Malayalam Pair
2.1 Motivation
Developing a translator from English to Malayalam is a challenging task as the availability of the parallel corpus with alignments is still an open area of research. A framework to create parallel corpora for Indian languages from scanned images of paper documents with parallel text has been discussed in the literature [18]. However, a standard parallel corpus that is sentence aligned is also hard to find in the English–Malayalam language. Therefore, the training data collection process involves retrieving parallel text from online resources and cleaning them with manual effort. Setting phrase alignments manually is again difficult and is not practical on a huge corpus. Thus, the motivation behind the work discussed here is the necessity to develop a technique for automatically aligning verb phrases in a parallel text and creating verb phrase-annotated parallel corpora in English–Malayalam. The complexity of training in the translation task is reduced with this verb-aligned corpus, and by-products of this work like suffix separators and dictionaries can be made useful in many other natural language processing (NLP) task in Malayalam.
2.2 Issues in Translation
The major issue arises when the SMT technique is adapted to translate languages that have differences in their language families. This difference is reflected in the structure of the languages and the order of the words in their sentences. Again, when these differences between two languages are considerable, it becomes difficult for the SMT system to identify and learn the translation patterns.
As English and Malayalam belong to two different language families, setting a correspondence between them requires a deep semantic understanding of both languages. Again, a close structural analysis of the sentences in English and Malayalam reveals the complexity of finding a match between corresponding units. Re-ordering the phrases is an inevitable component in English and Malayalam pair as they differ in the structural ordering by following subject–verb–object (SVO) and subject-object-verb (SOV) word orders, respectively.
Any NLP task that involves the use of a parallel corpus, the quality of the alignments in the parallel corpus, is an important factor determining the quality of the final outcome. The results of the training phase in SMT depend on the number of alignments that exists between the target and source pair sentences. The number of alignments that are present in the training data is a matter of concern when SMT is tried in languages with scarce resources. For such languages, the available corpus is not sufficient enough to provide meaningful alignment, which is a crucial aspect that affects the quality of translation.
2.3 Factors Influencing the Quality of Translation
Though MT is an automatic process of translating text from one language to another with the help of machine learning algorithms, there are various factors that affect the quality of the translation. One among them is the translation divergence problem that exists due to the diversity between the language pairs [5]. A key issue in translating a text from any language to another requires the knowledge of both languages including the similarities and the divergence that exists between them. Representing the same matter in two different languages may be different in many aspects and, hence, results in language divergence. When similarities between the languages enhance the results of the translation process, the divergence issue widens the gap between the translated output and its real translations [5]. The existence of such cross-linguistic distinctions among the source and target language makes the transfer of one language to another nearly impractical. Therefore, language divergence is an important feature that affects the results of MT task.
Another approach adopted to improve the alignment model is to bridge the gap between the parallel texts and make it suitable for setting up alignments. By reducing the language difference existing between the source and target text, a better alignment model can be constructed. The problem of translation divergence and its solution is, therefore, an interesting area for researchers handling MT development tasks. The first step to deal with language divergence problem is to identify the patterns of divergence among two languages. On observing the task of identifying patterns of divergence closely, it is understood that it is difficult to come up with a generalized strategy that can be applied to any language pair. Each language pair has to be analyzed intensively in order to identify the patterns, and the strategy applied may vary from one language to another. Table 1 shows some observations made to understand the divergence that exists between the English and Malayalam pair.
Comparison of English–Malayalam Sentence Pair.
Characteristics | English | Malayalam |
---|---|---|
Word endings for identifying object | Not present | Object has many suffixes, Ex: ![]() |
Nouns | No gender differentiation example: “student” | Gender applicable, Ex: ![]() |
Article | No ending to denote number or gender, Ex: ‘the’ | No article corresponding to “the”, resulting in null word in translation |
Word order | Very important, word order determines subject and object, SVO, Ex: Ram likes Sita | Not important, free order, usually SOV, ![]() |
Verb phrase | Combination of auxiliary verbs and ordinary verbs, auxiliary verb followed by ordinary verb, Ex: was dancing, shall be dancing | No auxiliary word in Malayalam, results in endings and depends on the tense and auxiliary verb |
Prepositions | Use of preposition, Ex: of, on, at, etc | Changes to suffixes, no equivalent word in Malayalam |
3 Methodology
The process of aligning phrases of a text in English with its corresponding Malayalam units involves different steps.
3.1 Extraction of Key English–Malayalam Sentences
Word frequency count and sentence length are the two criteria applied to extract the most relevant English–Malayalam sentence pairs from the parallel corpus.
A collection of English words with high frequency is extracted from the English sentences after removing the stop words, duplicates, and numbers to form the most common words in the corpus. The sentences with more than two common words that exist are selected to be the key sentences of English, and their corresponding Malayalam sentences together forms the bilingual pair. The sentence length is taken as another criterion for selecting the sentences from the corpus in the second approach. Separate samples are taken for different sentence lengths, and experiments are conducted on these samples.
The algorithm of the key English_Malayalam extraction module is discussed below:
Input: English_Malayalam parallel sentences |
Output: Eng_Sent{} and Mal_Sent{} with key |
English_Malayalam pairs |
Algorithm: |
for sentence ei in English_Malayalam corpus where 1<=i<=n |
if (len(ei) between min(sent_length)and max(sent_length)) or |
if((words(ei) ∩ high_frequency_wordset))>threshold) |
Assign ei to Eng_Sent{} |
Assign mi to Mal_Sent{} |
The high_frequency_wordset contains the top 25 words listed according to their decreasing frequency count. The min(sent_length) and max(sent_length) are initialized to two and 12 words, respectively. The threshold variable is assigned the value 2 and 3 for choosing the key sentences from the original raw corpus.
3.2 English Sentence Structure Analysis for Translation
The English language has a well-defined grammatical structure with very little inflections when compared to other languages [6]. The grammatical units present in the English language are divided into categories like words, phrases, clauses, and sentences. Table 2 gives the description of the grammatical units and the entities that belong to each unit. An intensive analysis of these grammatical units is required in a translation task from English to any other target language and vice versa.
Grammatical Units Present in English Languages.
Grammatical units | Entities |
---|---|
Word classes | Noun, verb, adjective, adverb, preposition, determiner, pronoun, conjunction |
Phrases | Noun phrase, verb phrase, adjective phrase, adverb phrase, prepositional phrase |
Clauses | Independent clause, dependant clause |
Sentence elements | Subject, verb, object, complement, adverbial |
Sentences in English are broadly classified into simple sentences, compound sentences, complex sentences, and compound-complex sentences based on the structural difference that arises with the presence of clauses, which are a group of words with a subject and a verb.
The clause information is an essential feature required to generate the English sentence structure templates. An independent clause is a group of words with subject and verb that expresses a complete thought, whereas a dependent clause has to be related to an independent clause for making sense. The knowledge of English sentence structure and type of clause is essential in differentiating various cases of verb phrases in the Malayalam language. Table 3 demonstrates certain samples of patterns corresponding to independent and dependent clauses.
Patterns Corresponding to the Clause Structure.
Type of clause | Patterns |
---|---|
Independent clause | IC_1:Subject_Verb |
IC_2:CompoundSubject_Verb | |
IC_3:Subject_Compound Verb | |
IC_4:Subject_Verb_Compound Direct Object | |
Subordinate clause | SC_1:RelativePronoun_AdjectiveClause |
SC_2:SubordinatingConjunction_AdverbClause | |
SC_3:NounClausemarkers_NounClause | |
SC_4:Elipticalmarkers_NounClause |
The English sentence is classified based on the presence of independent and dependent clauses. Table 4 discusses the classification of English sentences based on the clause structure and the mapping of independent clauses and dependent clauses with each type of sentence.
Mapping of Clause Structure to Sentences.
Type of sentences | Independent clause | Subordinate clause | Pattern |
---|---|---|---|
Simple | One | No | IndependantClause_n |
Compound | Two or more | No | {IndependantClause_n}+{Coordinating Conjunction|;} {IndependantClause_n}+ |
Complex | One | One | IndependantClause_n SubordinateClause_n |
3.3 Pattern Identification Process
Patterns from sentences have to be identified and extracted from a sentence for setting up the phrase alignments between the bilingual texts. The part-of-speech information of a sentence is the key element that decides the patterns in a sentence. Pattern identification is done by applying finite state automata (FSA) [8] at various levels in a sentence structure. The input symbols in the FSA are the tags corresponding to each word in a sentence. The FSAs are applied at various levels to effectively identify various components of a sentence.
Several English sentences are thoroughly analyzed to identify various constituents that build up the language. Tagging an English sentence is the first step in identifying the constituents, and the tags that represent the part-of-speech information forms the input for designing the FSA in recognizing each linguistic component. The tagset used for POS tagging is the Penn Treebank [12] tagset, and the tagger used is the Natural Language Toolkit (NLTK) [10] tagger. Table 5 highlights certain mapping rules between the phrase entities and the patterns of tags used for identifying those phrase entities.
Patterns for Identifying Phrase Entities.
Phrases | Phrase patterns |
---|---|
Verb phrase | {Verb, Aux_Verb/Verb} |
Adjective phrase | {Adjective, Adverb_of_Degree Adjective} |
Adverb phrase | {Adverb, Adverb_of_Degree Adverb} |
A grammar-based chunk parser is used to chunk the English sentence into different entities like verb phrases, noun phrases, adjective phrases, adverbial phrases, etc., by considering all the possibilities of the tag patterns specific to each entity. The NLTK RegexpChunkParser that uses regular expression over tags to chunk a given text is employed for this purpose. RegexpChunkParser chunks a single kind of phrase like either noun phrases or verb phrases from sentences. The parser works with sentences using a set of regular expression patterns and defines the chunk rules. A different set of rules are set for each type of entity, and they are implemented using regular expression matching and substitution. Noun phrases and verb phrases have to be separately collected for pattern identification, and therefore, the phrases are parsed in parallel and extracted into different sets for further processing. An example of the chunking rules set for noun patterns is given below.
NP: {<DT|PP$>?<JJ>*<NN>}, {<NNP>+}, {<NN>+},
{<NNP> <CC> <NNP>}, {<NNS>+},{<PRP>}...
The algorithm for identifying entities from the English sentence is given below:
Input: Key English Sentence in Eng_Sent{} |
Output: Phrase entities identified |
Algorithm: |
Initialize the Chunk_Grammar_Rules_set as {gr1, gr2...grn} |
Initialize the Phrase_Entity_list as {NP, VP, AP...} |
for sentences ei from 1..n in Eng_Sentence{} |
Tokenized(ei)←Split ei into tokens |
POS_tagged(ei)←find_POS_tag_of[Tokenized(ei)] |
for gri from 1..n in Chunk_Grammar_Rules_set |
ParsedTree_of(ei)=RegexpChunkParser[POS_tagged(ei)] |
for each SubTree_of(ei) in ParsedTree_of(ei) |
for each Phrase_Entity_list[j] in Phrase_Entity_list |
if (SubTree_of(ei) ∩ Phrase_Entity_list[j]) |
Entity(ei)={SubTree_of(ei), Phrase_Entity_list[j], |
start_of(SubTree_of(ei)), end_of(SubTree_of(ei)) |
sentence_phrase_table.append_by_position(Entity(ei)) |
The entities detected are used to identify the underlying structure of the sentence. Two lists are created for each sentence that gives the details of the sentence elements and probable structure of the sentence. The mapping functions are given below:
sent_element_list=Map(sent_phrase_table, sent_elements_table)
sent_structure_list= Map(sent_element_list, sent_type_list)
The sentences are classified based on the mapping rules into the categories mentioned in the sentence structure table. Table 6 denotes sentence structures that follow various word orders in English language. The mapping rules that map phrase entities with sentence structure elements are listed in Table 7.
Sentence Structures.
Type | Structure |
---|---|
SV | Subject verb |
SVO | Subject verb object |
SVC | Subject verb complement |
SVA | Subject verb adverbial |
Mapping of Phrases and Sentence Elements.
Sentence element | Phrase entity |
---|---|
Subject | Noun phrase |
Verb | Verb phrase |
Object | Noun phrase |
Complement | Adjective phrase |
Adverbial | Adjective phrase, noun phrase, adverb phrase |
3.4 Feature Structures
The sentences are annotated with entity labels, and the sentences are represented into a form that describes the features present in that sentence like the length of the sentence, the phrase entities, their positions, and their sequence, etc. The feature structure describes the sentence in a measurable form, and detailed analysis of the feature structure is done to recognize the translation units in a sentence. An example of a feature structure is given below.
Feature Structure:
S: Mary is reading a book
S: [Mary]NP1 [is reading ]VP1 [a book]NP2
Feature Vector F=[
sentence id: e_id;
sentence length: 5 ;
sequence:NP1-VP1-NP2
np1: position=1, length=1;
vp1: position=2,3 length=2;
np2: position=4,5 length=2;
...
]
A feature structure has attributes and values that differentiate it from other sentences. The attributes are sentence id, length, sequence order, sentence element position, and length, etc. After investigating the post_segment and pre_segment of the verb phrase, the sentence type is identified, and it is integrated into the feature vector using the parameters sent type and word order.
3.5 Internal Phrase Structure Analysis With Finite State Automata and State Sequence Matrix
The chunks of phrases extracted from the sentence bear variety of the internal structure, and the structural analysis of the tag pattern is carried out to differentiate them case by case. This is accomplished by designing a FSA for each entity and applying them in a cascaded manner in the order of the sequence attribute in the feature vector.
The feature vector is analyzed, and the FSA corresponding to each parameter in the sequence attribute is executed for identifying the state sequence of the patterns. There are FSAs corresponding to each sentence entities, and they are used for the construction of their state sequences. The FSA’s are designed such that they end up in a final state, and each of these different final states has significance in determining the various forms of noun phrases.
The state sequence identified are represented in a state sequence matrix, which is a two-dimensional array, where each row denotes the sentence entities in a sentence, and the columns represent the sequence of states required to reach the final state. For example, in the sentence mentioned earlier, the state sequence matrix has three rows representing the three sentence entities in it, and each row’s column elements denote the state traversed in the FSA. There is delimiter “#” to denote the end of transition in each row.
3.6 Verb Phrases and Malayalam Inflections
When translating an English sentence into Malayalam, the identification of the verb phrase structure is an essential step as the verb phrase has to be subjected to transformation in position due to the SVO order difference between the two languages. The translation of English verb phrases into Malayalam is carried out with the knowledge of inflections in Malayalam. Many of the verb phrases in English results in an agglutinated Malayalam verb phrase on translation. Some simple sentences in English with sentence entities like subject and verb along with their Malayalam translations are listed in Table 8. The sentences represent different forms of verbs that indicate various tenses in English. In all the sentences, the number of words in the English verb phrase is between 1 and 3, whereas its Malayalam translation has a single word as its verb.
English Sentence in Different Tense and its Malayalam.
English | Equivalent Malayalam translation |
---|---|
Alice sings | ![]() |
Alice was singing | ![]() |
Alice will be singing | ![]() |
Alice will have been singing | ![]() |
3.7 Mapping the Verb Patterns With Malayalam Inflections
The verb phrases in English extracted have to be mapped with the verb inflections in Malayalam. A finite state transducer (FST) [14] that acts as a translator is employed for this purpose. The input to the FST is the tagged verb phrase, and the output indicates the rule for forming the inflections. The verb phrase “will be singing” is taken as input into an FST with labeled rule sets.
The elements (base_verb, -1) are pushed into a stack for the phrase “will be singing”, and they are reordered by identifying the operator that appears as a prefix to the Malayalam inflections as (base_verb, -1),
. The first term indicates the alterations that need to be done in the base Malayalam verb. A “−1” after the base_verb denotes that the last character of the base verb has to be deleted by one position.
denotes the suffixes that are added to the Malayalam base_verbs and the number of “+” operator denotes the order in which the suffix addition happens.
The suffixes are added based on the sandhi rules specified in the Malayalam language by using a sandhi rule joiner. An English–Malayalam dictionary is used to find the Malayalam word corresponding to the English word “sing” in the earlier example. If the word entry is not found in the built-in dictionary, the rule set is reduced to two phrases without the base_verb version, and the addition is done. For e.g.:
Input: English verb phrases |
Output: Inflections to search in Malayalam text |
Algorithm: |
EnglishVP_list←Entity(ei) if (Phrase_Entity_list[j] ==VP) |
for each verb phrase vpi in EnglishVP_list in ei |
Stack[top++]←FST_inflections(vpi) |
Initialize k as zero |
inflection_list=inflection_list.append (if(Stack[top--].pop()!=-1)) |
suffix_keys[k]baseverbform_of(vpi) |
If (Eng_dict(suffix_keys[k])) |
m_word=Malayalam_of(suffix_keys[k]) |
suffix_keys [k++]←ReorderRules_on(inflection_list, m_word) |
else |
suffix_keys [k++]←ReorderRules_on (inflection_list) |
UpdateFeatureStructure_of(ei) with suffix_keys{} |
3.8 Alignment Function Using Position Vectors
Aligning the phrases in English with Malayalam starts with tracking the sentence entities and their sequence order. Analysis of the sentence type and word order are the inputs into the alignment phase. Transformation happens at different levels starting from reordering the sentence based on the word-order sequence. The subject of a sentence is mapped as noun phrases that are positioned before the verb phrases in an English sentence. The noun phrases identified from the feature vector of a sentence are subjected to internal structure analysis in this phase. The state sequence matrix gives the final state in the state sequence, and every final state denotes the pattern of the noun phrase present as the subject.
An alignment structure is formed from the feature vector where each field describes the features of the subject phrase. Sub attribute information is also stored in the alignment vector along with other features. An example of a sub attribute is the presence of the determiner “the”. The word “the” in Malayalam translation results in a null word and, therefore, is ignored in the alignment phase. The position of words in the Malayalam sentence to which the subject of the English sentence is aligned is indicated as the target position mapping.
Aligning verbs is an important factor as it forms the post segment in a Malayalam sentence as Malayalam follows the SOV word order. The verbs in the Malayalam phrase are aligned analyzing the features in the feature vector of the English sentence. Along with other features, the alignment vector holds the Malayalam inflections corresponding to the verb phrase in English that is identified using the FST. The position vector identifies the words from the Malayalam text, and the inflections in the words are analyzed to find a match with the Malayalam inflections present in the alignment structure. Table 9 shows an example of the alignment structure corresponding to a noun and verb phrases that act as a subject and verb, respectively.
Example of an Alignment Structure Corresponding to Sentence Entities.
Attributes | Noun | Verb |
---|---|---|
Sentence entity | np1 | vp1 |
Sentence elements | SUBJECT | VERB |
Entity length | 1 | 2 |
Value | “MARY” | “IS READING” |
Final state | STATE_n1 | STATE_v2 |
Sub_AttributesDet | ABSENT | - |
Target position | M0 | Mn-1 |
The algorithm of aligning the verb phrases is given below
Input: Malayalam sentence, inflections search keys |
Output: Target phrase positions for alignment |
Algorithm: |
for each verb phrase vpi in ei in Eng_Sent{} |
mi=ParallelMalSent_of(ei) |
mal_tokens =Tokenized(mi) |
Assign k as zero |
if (suffix_keys[].item in mal_tokens) |
candidate_mwords{}=mal_token |
if (candidate_mwords[i] in FeatureStructure(ei)) |
target_position[]←find_position(FeatureStructure_of(ei), |
candidate_mwords[i]) |
AlignPhrase(vpi, mi[target_position]) |
4 Results and Observations
Experiments were conducted to analyze the parallel corpus and add alignments between the parallel sentences using the verb phrase alignment algorithm.
The parallel English–Malayalam corpus used for experiments is provided by Technology Development for Indian Languages Programme (TDIL), India, and the Corpora created by Amrita, CEN NLP group [http://nlp.amrita.edu/nlpcorpus.html]. This corpus was built and used in the Shared task and workshop on Machine Translation in Indian languages (MTIL-2017) conducted at CEN, Amrita Vishwa Vidyapeetham, Coimbatore jointly with LDC-IL, CIIL, Mysore [https://www.amrita.edu/event/shared-task-cum-workshop-machine-translation-indian-languages-mtil].
The characteristics of the corpus used for the experiments are given in Table 10. The parallel corpus has a variety of sentences differing in length and structure. It has short sentences with one-word length and the very long sentences with lengths above 60 words.
Characteristics of Parallel Corpus.
Number of sentences in the corpus | 102,574 |
Number of sentences removing duplicate | 99,487 |
Different samples of sentences based on its length are taken to study the behavior of the corpus. The verb patterns identified under each sample set is given in Table 11. The details of the verb phrases extracted from the corpus are given in Table 12. A sample rule set containing various rules of verb structures that demonstrates the essence of sentences with different tenses is created for the testing purpose. The parallel sentences have been classified based on the presence of the number of verb phrases. The chart given in Figure 1 shows the distribution of sentences based on the number of verb phrases present in them. This classification is done with the verb phrases extracted with the rules in the rule set. The rule set is subjected to an extension with the addition of new verb rules by which more verb phrases are likely to be identified and extracted from this set.
Sample of Patterns Identified.
Samples | Sentence length | Number of sentences | Number of patterns |
---|---|---|---|
Sample 1 | 1≥L≤5 | 5497 | 3312 |
Sample 2 | 6≥L≤10 | 24,632 | 22,998 |
Sample 3 | 11≥L≤15 | 25,992 | 25,788 |
Sample 4 | 16≥L≤20 | 18,360 | 18,322 |
Sample 5 | 21≥L≤25 | 11,157 | 11,153 |
Verb Phrases Extracted from the Corpus.
Number of sentences | 102,071 |
---|---|
Number of tag patterns after removing duplicates | 99,157 |
Number of different tag patterns | 95,066 |
Number of verb phrases in the corpus | 239,711 |
Number of verb phrases identified with rule set | 207,610 |

Sentence Distribution Based on Verb Count.
Figure 1 shows that sentences with verb counts up to five, and sentences with verb counts one, two, and three form around 65% of the total distribution. Major experiments are done on sentences that have verb counts one and two as they form more than 50% of the whole distribution and are classified as set 1 and set 2, respectively.
The quality of alignments obtained with these sets is analyzed using the evaluation metrics like Precision, Recall, and AER. Raw alignments have been refined further with the help of the feature vector knowledge. Verb dictionary also helps in choosing the correct alignment. The results obtained are summarized in Table 13.
Summary of Results.
Alignments | Alignment error rate (AER) |
---|---|
Raw alignments | 0.5718 |
Alignments refined with verb dictionary and position vectors | 0.2212 |
A set of raw alignments are obtained by applying the suffix matcher method in the sentences. A number of unwanted phrases along with the correct phrase of alignment results from the suffix matcher. The alignments obtained are refined by applying the feature structures, and the unwanted alignments are eliminated. The alignments are classified as sure alignments and possible alignments for evaluating them using the metrics. Human effort is used for finding the sure and possible alignments from the aligned phrases.
A discussion on factors affecting the quality of alignments is given below.
Length of the sentence: The results were influenced by many factors starting with the length of the sentences. The length of the sentence is inversely proportional to the quality of the alignments obtained. Long sentences increased the rate of unwanted alignments, which increased the need for refinement. Around 45% increase in the number of alignments has been observed when compound sentences in the corpus were split into simple sentences.
Number of verb phrases: The number of verb phrases affects the quality inversely as it becomes difficult to differentiate the verbs. More attributes in the feature vector have to be incorporated for dealing with such phrases. Splitting the verbs also helps to bring down the ambiguity. Again, the verb phrases that matched the longest pattern had to be retrieved using the rule set. For example, among the tag patterns “VBD VBN VBG”, “VBD VBN”, “VBD”, etc., “VBD VBN VBG” has to be extracted from the sentence in order to get the correct alignment.
Sentence structure: Improper structures of English and Malayalam sentences have an adverse effect on the alignments. Sentences are expected to be grammatically correct. The Malayalam structure has to be in the SOV form as free word-order Malayalam sentences respond negatively to the application of position vectors. For example, the sentence pairs given below needs correction in the structure.
ENG_SENT: Karan was a Roman Catholic nun
MAL_SENT:
Feature vector attributes: Embedding more attributes into the feature vector set has enabled to reduce the ambiguity and pick the correct Malayalam verb phrase. A close analysis of sentence structures is required in achieving attribute knowledge.
Mapping rules: The number of alignments depends on the number of rules present in the mapping rule set. The handcrafted mapping rules for identifying the sentence structures and their implementation are done in an incremental mode. The category of sentences addressed so far is subject–verb (SV), SVO, subject–verb–complement (SVC), subject–verb–adverbial (SVA), and their combinations. The mapping rule set extends from simple to complex forms. The rule set can be extended further with the knowledge of linguistic features of both languages. The ambiguity elements that exist in the language is minimized on the addition of supporting rules.
The results obtained are subjected for further improvement by enhancing the rule set with more rules and embedding more attributes into the feature vector. Rule wise analysis is done to incorporate more attributes into the feature vector. Many of the rules like the ones representing simple tenses are showing promising results and have a coverage of more than 80% alignments.
5 Conclusions
Phrase-aligned parallel corpus is an inevitable component in MT as well as in any other NLP task where the internal structural analysis of parallel text is essential. A technique to automatically align verb phrases of English and Malayalam in a sentence-aligned parallel corpus is discussed here. An analysis of the English sentence structure is done, and issues in aligning English–Malayalam sentence pairs are presented here. Feature structure identification of the English sentence is done to extract corresponding translation units in Malayalam using finite state transducers and by applying position vector mapping. Handcrafted rules are developed for various types of sentences, and phrase-level alignment is done for an English–Malayalam pair. The quality of alignments is evaluated using Precision, Recall, and AER metrics, and the results obtained are promising. As future work, this framework is subjected to extension by enriching the rule set with additional verb rules and by embedding more attributes into the feature vector set along with position vectors.
Acknowledgment
Major part of this work was done using the HPC facility at Sunya Labs, Rajagiri School of Engineering and Technology, Kochi, Kerala.
Bibliography
[1] M. Anand Kumar, V. Dhanalakshmi, K. P. Soman and S. Rajendran, Factored statistical machine translation system for English to Tamil language, Pertanika J. Soc. Sci. Humanit.22 (2014), 1045–1061.Search in Google Scholar
[2] P. J. Antony, Machine translation approaches and survey for Indian languages, Int. J. Comput. Linguist. Chin. Lang. Process.18 (2013), 47–78.Search in Google Scholar
[3] P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin, A statistical approach to machine translation, Comput. Linguist.16 (1990), 79–85.Search in Google Scholar
[4] P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, The mathematics of statistical machine translation: parameter estimation, Comput. Linguist.19 (1993), 263–311.Search in Google Scholar
[5] B. J. Dorr, Machine translation divergences: a formal description and proposed solution, Comput. Linguist.20 (1994), 597–633.Search in Google Scholar
[6] J. Eastwood, Oxford guide to English grammar, Oxford University Press, New York, 1994.Search in Google Scholar
[7] A. Fraser and D. Marcu, Measuring word alignment quality for statistical machine translation, Comput. Linguist.33 (2007), 293–303.10.1162/coli.2007.33.3.293Search in Google Scholar
[8] D. Jurafsky, Speech and language processing, Pearson Education, Inc., India, 2000.Search in Google Scholar
[9] P. Koehn, F. J. Och and D. Marcu, Statistical phrase-based translation, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54, Edmonton, Canada, 2003.10.3115/1073445.1073462Search in Google Scholar
[10] E. Loper and S. Bird, NLTK: the natural language toolkit, in: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Volume 1, Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002.10.3115/1118108.1118117Search in Google Scholar
[11] A. Lopez, Statistical machine translation, ACM Comput. Surv. (CSUR)40 (2008), 8.10.1145/1380584.1380586Search in Google Scholar
[12] M. P. Marcus, M. A. Marcinkiewicz and B. Santorini, Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist.19 (1993), 313–330.10.21236/ADA273556Search in Google Scholar
[13] J. Martin, R. Mihalcea and T. Pedersen, Word alignment for languages with scarce resources, in: Proceedings of the ACL Workshop on Building and Using Parallel Texts, pp. 65–74, Ann Arbor, Michigan, 2005.10.3115/1654449.1654460Search in Google Scholar
[14] M. Mohri, Finite-state transducers in language and speech processing, Comput. Linguist.23, (1997), 269–311.Search in Google Scholar
[15] S. Nießen and H. Ney, Improving SMT quality with morpho-syntactic analysis, in: Proceedings of the 18th conference on Computational linguistics, Volume 2, Association for Computational Linguistics, Saarbrücken, Germany, 2000.10.3115/992730.992809Search in Google Scholar
[16] F. J. Och and H. Ney, A comparison of alignment models for statistical machine translation, in: Proceedings of the 18th conference on Computational linguistics-Volume 2, pp. 1086–1090, Saarbrücken, Germany, 2000.Search in Google Scholar
[17] F. J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, Z. Jin and D. Radev, A smorgasbord of features for statistical machine translation, in: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Rochester, NY, USA, 2004.Search in Google Scholar
[18] B. Premjith, S. Sachin Kumar, R. Shyam, M. Anand Kumar and K. P. Soman, A fast and efficient framework for creating parallel corpus, Indian J. Sci. Technol.9 (2016), 1–7.10.17485/ijst/2016/v9i45/106520Search in Google Scholar
[19] A. R. Raja Raja Varma, Kerala panineeyam, DC Books, India, 5 2006.Search in Google Scholar
[20] K. Toutanova, H. Suzuki and A. Ruopp, Applying morphology generation models to machine translation, in: ACL, pp. 514–522, Columbus, Ohio, USA, 2008.Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles in the same Issue
- Frontmatter
- Neural Network-Based Architecture for Sentiment Analysis in Indian Languages
- Sentiment Polarity Detection in Bengali Tweets Using Deep Convolutional Neural Networks
- Neural Machine Translation System for English to Indian Language Translation Using MTIL Parallel Corpus
- Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
- Composite Sequential Modeling for Identifying Fake Reviews
- Deep Learning Based Part-of-Speech Tagging for Malayalam Twitter Data (Special Issue: Deep Learning Techniques for Natural Language Processing)
- Machine Translation in Indian Languages: Challenges and Resolution
- MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation
- An Overview of the Shared Task on Machine Translation in Indian Languages (MTIL) – 2017
- Neural Machine Translation for Indian Languages
- Verb Phrases Alignment Technique for English-Malayalam Parallel Corpus in Statistical Machine Translation Special issue on MTIL 2017
- Development of Telugu-Tamil Transfer-Based Machine Translation System: An Improvization Using Divergence Index
Articles in the same Issue
- Frontmatter
- Neural Network-Based Architecture for Sentiment Analysis in Indian Languages
- Sentiment Polarity Detection in Bengali Tweets Using Deep Convolutional Neural Networks
- Neural Machine Translation System for English to Indian Language Translation Using MTIL Parallel Corpus
- Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora
- Composite Sequential Modeling for Identifying Fake Reviews
- Deep Learning Based Part-of-Speech Tagging for Malayalam Twitter Data (Special Issue: Deep Learning Techniques for Natural Language Processing)
- Machine Translation in Indian Languages: Challenges and Resolution
- MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation
- An Overview of the Shared Task on Machine Translation in Indian Languages (MTIL) – 2017
- Neural Machine Translation for Indian Languages
- Verb Phrases Alignment Technique for English-Malayalam Parallel Corpus in Statistical Machine Translation Special issue on MTIL 2017
- Development of Telugu-Tamil Transfer-Based Machine Translation System: An Improvization Using Divergence Index