Some linguistic comments concerning the obtained output

Theodor Hristea

 

We would like to mention, from the very beginning, that, in most cases, the computer programs implementing the proposed WordNet algorithms work correctly, and that, when the obtained results are not the best possible ones, it is mainly because of the imperfection of the existing bilingual dictionaries. The project web page shows primarily those situations in which the programs make mistakes, or in which they propose more than one Romanian synset, leaving it up to the linguist to choose the most adequate one, mainly according to the gloss. In what follows, we shall try to comment on the main types of mistakes which can occur as a result of automatic processing, and to briefly analyze the causes of these mistakes.

We would like to point out especially the following three types of situations: those in which the program has generated more than one Romanian synset, out of which one is correct, those in which no Romanian synset has been generated, and those, very rare cases, in which one or more synsets have been generated, none of them being correct.

In those cases when two or more Romanian synsets have been generated, among which the correct one occurs, finding it according to the gloss was generally an obvious operation for the linguist.

We consider as being much more interesting those situations in which no Romanian synset was generated. Most frequently, the cause for this is the imperfection of the bilingual dictionaries, which simply do not include those words. Sometimes only one of the dictionaries is to blame, usually the Romanian-English one, relatively poor concerning the number of entries, but also as far as the number of English words taken into consideration for performing translations is concerned. Due to this fact, there are many cases in which only unlabeled e-sets are obtained via the proposed algorithm. No Romanian synset will be generated in such cases.

Situations of different natures in which no Romanian synset is generated therefore exist. Either the word was not found in the English - Romanian dictionary, which directly affects the translation of English synsets containing a unique word, that are frequent enough in WordNet, or it was found but, corresponding to it, only unlabeled e-sets were generated. The latter situation is the most frequent. It is, for instance, the case of crook, having the meaning "a long staff with one end being hook shaped", or the case of wreckage, having the meaning "the remains of something that has been wrecked".

Sometimes the Romanian synset generated by the program is incorrect because of the evaluation function which was implemented. Other evaluation functions should be implemented and tested in future studies. Most frequently, however, the evaluation function taken into consideration now does not work correctly again because of the incompleteness of the existing bilingual dictionaries. It is, for instance, the case of the synset formed with the unique word rule having the meaning "directions that define the way a game or sport is to be conducted", translated into Romanian by [rigla], as well as of the synset [convention], having the meaning, coming from diplomacy, "an international agreement". It was translated into Romanian by the synset [adunare, intrunire, congres], denoting the concept of "congress", instead of the correct [conventie, acord, contract, invoiala, intelegere, pact, tratat].

As we have already mentioned, the situation in which a Romanian word occurring in the English-Romanian dictionary is not found in the Romanian-English one is quite frequent. It is especially the case of nouns coming from verbs and having the significance "the action of...". Important and frequent Romanian words like organizare (coming from "a organiza" - "to organize") or respingere (coming from "a respinge" - "to reject"), occur as translations of various English words but are not to be found in the Romanian-English dictionary. This can determine the algorithm for the evaluation of e-sets to fail, since the absence of a word from the Romanian-English dictionary leads to a lower value of the corresponding e-set.

Also due to the incompleteness of existing bilingual dictionaries many recent borrowings which exist in Romanian (especially in mass-media) will not occur in the generated Romanian synsets.

In those, more interesting, cases in which the Romanian-English dictionary is not to blame, the cause of the errors which the programs generate is of a completely different nature. One should look for it in connection with concepts. In this case one must take into account the fact that English in general and American English, to which WordNet refers, in particular, is a much richer language than Romanian. Statistically speaking, while Romanian has a maximum of 150,000 words, American English includes approximately 450,000 words (according to information provided by the lexicographer St. Berg Flexner). But, in comparison with Romanian, English is a much more advanced language not only from a grammatical and lexical point of view. Quantitatively it includes more words or lexical units. However, English is much more advanced from the semantic point of view as well, since an English word often has a much richer semantic content than the corresponding Romanian one. Numerous words existing both in English and in Romanian are more polysemous in English than in Romanian. In other words, the polysemy of many English words is greatly superior to that of the corresponding Romanian ones. For instance, the English word feature having the meaning of "an article of merchandise that is displayed or advertised more than other articles" has no correspondent in Romanian. No single word with this meaning exists. We are therefore obliged to perform translation using a group of words (a gloss), while the English synset containing the sole word feature which refers to this concept will have no Romanian counterpart. In this case the computer program did not work correctly. It is, once again, a situation which affects primarily English synsets containing a single word. Another example of an English polysemous word is foundation, which attracted our attention through one of its meanings, that of "a woman's undergarment worn to give shape to the contours of the body". This meaning of foundation does not exist in Romanian. The concept to which the synset containing the unique word foundation with this meaning refers to should be explained in Romanian by means of a gloss. No corresponding Romanian synset should exist. The computer program has again failed in this case, just as it has in the case of the English quiver having the meaning "a case for holding arrows".

Another situation in which the program did not work correctly refers to certain English nouns used with a negation. This is, for instance, the case of matter with negation, as in "they were friends and it was no matter who won the game". This English noun should be translated into Romanian by a collocation, centered around a noun which does not occur in the English-Romanian dictionary among the possible translations of matter. Another possibility is that it does occur, however by means of an equivalent of collocational type, that will not be used by the algorithm which the program implements. In such cases the program can not determine the Romanian (or, in general, the foreign) synset correctly. Specifically, in the case of matter used with a negation, several possible Romanian synsets have been generated. None of them is, however, correct, since none of them includes the noun importanta (importance), which occurs in the Romanian collocation corresponding to this meaning. This Romanian collocation represents o loan translation of the French "avoir de l'importance". Loan translations after French are extremely frequent in Romanian. This is why we feel the need for future programs to take into account collocations, both in English and in Romanian, or, more generally, in the target language.

Other times, the unique English noun of a synset is not translated into Romanian by a collocation but by a word having exactly the same form. Even so, the program does not work correctly in some of these situations. It is, for instance, the case of the English synset [act] which denotes the concept "lack of sincerity". It has been wrongly translated into Romanian by the synset [fapta, fapt, act, actiune] which contains, among others, a Romanian word having the same form - act. But this meaning of the English act - lack of sincerity - does not exist in Romanian. This represents an example of what linguists call "false friends". In such cases one deals with English words which exist in an identical or very close form in other languages as well, however without having the typical English meaning. It is also the case of the synset [pattern] having the non-existent meaning in Romanian "the path that is prescribed for an airplane that is preparing to land at an airport" or that of the synset [cosmos] having the meaning "any of various mostly Mexican herbs of the genus Cosmos". Many of these meanings are typical to American English. Another example is offered by the synset [circumstances] denoting the concept "the state (usually personal) with regard to wealth" wrongly translated by the Romanian synset [imprejurari, circumstante, conditii]. This meaning of circumstances (plural) exists both in British and in American English, but not in Romanian.

Another source of difficulties was represented by nouns in plural form. Some of the English synsets contain nouns in singular form which should be translated by plurals in Romanian. Examples from this category are foundation translated by the plural fonduri, or knowledge translated by cunostinte. In order to deal with such situations we have decided to include the plural forms of these nouns in the Romanian-English dictionary which was used by the computer program. The program was thus able to take into consideration e-sets containing nouns in plural form as well.

In Romanian, as in other languages, like French, for instance, the relationship between homonymy and polysemy represents an extremely complicated issue, a problem which is not yet solved. In many cases, according to various researchers, one deals with two, three or even more homonymous words, while according to others with a unique polysemous word, having two, three or even more fundamental meanings, which are more or less related to one other. An example would be the word bun (good), which in Romanian is primarily an adjective having seven fundamental meanings. Secondly it represents a noun having two different plural forms, which are semantically specialized. The Romanian noun bun (good) having the plural bunuri has four meanings, while the same noun bun with plural form buni has only one meaning, that of grandfather. These situations occur quite frequently in Romanian. The computer programs designed within the framework of this project will produce better results when using dictionaries which treat possible homonyms, especially the so-called semantic ones, as a single polysemous word. Otherwise the gloss should be taken into account from the very beginning in order to establish the meaning, namely the concept to which the English synset refers.

To conclude, one can say that the main difficulties which occurred when automatically translating the English synsets into Romanian ones were generated by the so-called "false friends", by collocations, by loan translation, and by the fact that the polysemy of many English words is greatly superior to that of the corresponding Romanian words. At the same time, one must notice that most problems occurred when translating English synsets that contain a single word, the algorithm often being unable to decide among meanings. Such synsets should probably be subject to further investigation. On the other hand, we would like to emphasize the fact that, in the absence of truly competitive tools (with reference to paper and electronic dictionaries) the realistic evaluation of the computer programs becomes rather difficult, if not almost impossible.

One can not conclude without pointing out some of the merits of the proposed algorithms. Let us start by noting, for instance, that, in spite of all mentioned difficulties, a great number of English synsets containing a unique polysemous word have been correctly translated into Romanian synsets containing a single polysemous word, as well. Examples are: the synsert [art], correctly translated into [arta], or the synset [creation], again correctly translated into [creatie].

As it is well known, concepts are language dependent. In many cases it may happen that an English word covers a very wide concept, being linked to several Romanian words which refer to various related and much more specific concepts. One of the examples given for Bulgarian by Nikolov and Petrova (2001), concerning this aspect, stands for Romanian as well. It involves the synset containing the unique word castle. When translating castle into Romanian, words like fortareata (fortress) or citadela (citadel) will occur. They denote related but different concepts. We would like to carry this comment further, by noticing that this represents a situation when the proposed algorithm works correctly, by producing unlabeled e-sets which will then be rejected. The translation into Romanian of synset [castle] having the meaning "a large building formerly occupied by a ruler and fortified against attack" is a correct one.

Finally, one should notice the fact that, in most cases when the bilingual dictionaries were correct and complete, the implemented algorithm proved to work surprisingly well. Thus, in the case of concepts which are very close to one another in English, the existing subtle difference in meaning has been sensed by the algorithm which correctly maintains it in the Romanian translation. It is, for instance, the case of the English synsets [banishment, proscription] having the meaning "the act of banishing someone" and [ostracism] having the meaning "the act of excluding someone from society by general consent" respectively. The first was translated into Romanian by the synset [exilare, surghiunire, exil, surghiun, expulzare, ostracizare], while the second one was translated into the unique [ostracism]. The Romanian ostracism is the only of all these synonym words which also refers to consensus in making the banishment decision. Its occurrence in the second synset, as a unique element, points out the subtle difference between the two concepts to which the English synsets refer.

We would like to conclude by saying that such a study concerning the possibility of automatically or semiautomatically generating foreign synsets by starting from the American ones is undoubtedly useful, and seems promising enough. We encourage its continuation in the case of the Romanian language, and we suggest its enlargement due to the study of collocatios in the near future. In order to perform a more or less complete study, these should be taken into consideration both in English and in Romanian, or, more generally, in the target language.


IST-2000-26454