The RORIC-LING Bulletin

months 13 - 18

 

General Questions Concerning the Two Discussed Approaches to Morphology

What exactly is a grammatical (or morphological) dictionary? (asked twice)

The grammatical dictionary is a representative summary of all basic word forms in a certain language accompanied by their grammatical characteristics. These features determine the generation of all word forms which are derived from the basic one (its word-formation) and provide the basic information for the results of the text analysis. Grammatical dictionaries are among the first NLP applications and represent a basic tool for collecting and organizing the linguistic data.

The morphological dictionary is a database which provides a wide range of data about the morphological characteristics and the forms of a certain word. It also allows for quick retrieval of grammatical information coming simultaneously from different templates (paradigm tables). The main purpose of a grammatical dictionary is to identify the relations between a concrete word form and its invariant (lemma). The purpose of the morphological dictionary is therefore to identify the word form and its characteristics and to classify it with regard to its lemma.

For more information on grammatical (or morphological) dictionaries, see the Bulgarian page of the BALRIC-LING project, at
                                                                          http://www.larflast.bas.bg/balric/index/index_eng.htm

What is the difference between "root" and "stem"? (asked twice)

The terms stem, theme or thema designate the root + affixes/infixes. Therefore the stem represents the inflexional base of a word to which other elements, such as thematic vowels and consonants, inflections etc. are added. Obviously, in many cases, the stem of a word can be identical to its root. For example, in Greek, the root lip underlies the present theme leip to which the inflection ein is added to form the present infinitive of the verb léipein (to leave).

What are the benefits of using the full-form lexicon in comparison to the inflexional approach to morphology? (asked 3 times)

The main benefits are:

  1. One avoids once and forever the painful, endless inflexional discussion at morphemic level and focuses directly on the bricks-words (but note, all word forms are considered in isolation). Inflexion concerns the word-formation language particularities. However, since today HLT has moved to text analysis for most European languages, one should be preoccupied in highlighting the word -> text focuses, a task which can be fulfilled by using the full-form lexicon.

  2. Mapping any text onto such a lexicon allows one to start discussing the POS-disambiguation problems, which today are considered the real problems of text analysis (at "word-level").

What is the basic criteria according to which a specific feature is included in the full-form lexicon? (asked 3 times)

The main criteria for inclusion of a feature can be expressed by the following question: "Is the feature to be included important for the production and distinction of paradigm members?"

 

Specific Questions Concerning the Full-Form Lexicon and the Corresponding Implementation

Can a morphological dictionary be used in order to design a spell checker for Romanian? (asked twice)

We don't think that it can be used directly. However, taking into consideration that, within such a dictionary, all inflexional forms of a word are present, we think that it can serve (when being complete, not as in our case a mere sample) for the creation of a list of words that could, in turn, be used in order to design a spell checker. Such a solution would only make use of a small part of the information which is included in a morphological dictionary.

What's the difference between a full-form lexicon and a derivational/inflectional one? (asked twice)

The main difference is that in a full-form lexicon one doesn't have the representation of the word's structure. You have only the relevant information about the word plus the form as such.

Why did you choose only newspaper articles for your corpus? (asked twice)

Just because the language in newspapers is representative for contemporary Romanian. This is a routine practice in work on corpora.

Isn't transitivity a feature interesting from a morphological point of view in Romanian? What is the reason for not including it into the set of verb features? (asked twice)

The reason is that, from a morphological point of view, transitivity is not relevant in Romanian since transitive verbs do not show special inflected forms.

Do you intend to extend the dictionary? (asked twice)

We would like to, but this depends on the opportunities regarding a new project.

Are there any other (on-line) contributions to a morphological lexicon for Romanian ? (asked twice)

To the best of our knowledge, no. But I also know that different attempts (and ongoing projects) exist in Bucharest (Romanian Academy Institute for Artifical Intelligence) and Cluj.

The passive voice is a morphological or a lexical category in Romanian? Does your lexicon include passive constructions? (asked twice)

Prescriptive grammars view the passive voice as a morphological category. We have serious doubts concerning this point of view. We preferred not to consider the passive voice a morphological category. We don't have passive constructions in our lexicon.

How do you deal with cases of morphological ambiguity? (asked twice)

I am repeating the explanation given in my web presentation. Suppose we have a word form such as fly. The lemma features help us to disambiguate the part of speech. So we have two lemmas, one for the noun fly and one for the verb fly. As for the verb, it is registered twice, with the following information: fly pr12sg; fly123pl.

It is unclear to me why you make the difference between proper names and common nouns. (asked twice)

This difference is needed because there are morphological differences between proper names and common nouns and our criterion has been the following: treat as morphologically relevant any feature (be it semantic or syntactic) which has morphological (i.e. inflectional) consequences.

Is the adjective in Romanian a category bearing article, too? Please explain to me the difference between an articled and a non-articled adjective. (asked twice)

Adjectives in Romanian have (definite) article indeed. Roughly speaking, this happens when they are placed in prenominal position. For instance, the adjective frumos (nice, beautiful) is non-articled in postnominal position (copilul frumos, literally child-the beautiful, that is "the beautiful child"), but articled in prenominal position (frumosul copil, literally beautiful-the child).

Wouldn't it possible to enrich your dictionary in a purely automatic way, that is, by inputting a word manually and by further constructing the rest of the paradigm automatically? (asked twice)

Of course it would, the only problem is building up such a program. This remains one of our tasks for the future.

What's the purpose of such a dictionary? (asked twice)

Such a dictionary could serve, for instance, as a tool in second language teaching.

Is your lexicon able to tell me what the inner structure of a word is? (asked twice)

No, it isn't. All it can do is give the word form along with its relevant information (for instance gender, number, case).

As far as I know, Romanian uses analytical ways of expressing the comparison degrees of adjectives. Is this the reason for not specifying the comparison degrees in your dictionary? (asked twice)

Yes. Adjectives with comparison degrees in Romanian are treated as words composed of other words.

Why do you need a distinction between lemma features and word form features? (asked twice)

The difference between lemma features and word form features has been adoped for reasons of uniform description. To a certain extent, it is a theoretical distinction, too, but we believe that a description made without it works equally well.

6678 word forms is a too small sample of a dictionary. Do you intend to further extend the lexicon? (asked twice)

Yes, we do, but this is supposed to be a part of another project.

The category of particles is not convincingly defended in your presentation. Would you like to be more explicit about the reasons for including it in your lexicon? (asked twice)

This category is fairly heterogeneous, indeed, but there is no other way to deal with words that are neither adverbs nor another part of speech. So, the solution we adopted was one of "high emergency".

Are the possessive and the demonstrative articles also represented in your dictionary? I was not able to find such a part of speech. (asked twice)

The two so-called 'articles' do not occur in the lexicon, probably because the corpus does not contain them. But there is no difficulty to extend the lexicon with these categories.

There is something that I am not yet able to understand, as far as the relation between the corpus and the dictionary is concerned. Does the lexicon contain only the word forms already contained in the corpus? Or does it contain more, that is, the full paradigm represented in the corpus by, say, one or two members? (asked twice)

It's very easy to check out this relationship (provided, of course, that you know Romanian a little bit!). But to be more specific, I will tell you that the lexicon is richer than the corpus. The corpus contains about 1500 word forms, while the lexicon contains the full paradigm of a word form occurring in the corpus.

The same feature is alternatively registered as a lemma feature and a word form feature. Why?

This is because, in one case it only characterizes the word form, while in the other case it is specific to the lemma itself. For instance, the gender of the noun is a lemma feature, because it does not determine the inflection. Nevertheless, the gender of the adjective does determine the inflection, and so it is a word form feature.

How does your tokenizer treat the Romanian word / construction am dormit? As two different words or as a single word?

Am dormit is taken to be a compound word - a collocation. This is the analysis provided for all compound verbal forms.

Do the words which are analyzed morphologically exist in a dictionary or are they analyzed automatically ?

I'm not sure I understand what you mean. If you are referring to the words you can find in the dictionary, they are there along with the relevant (morphological) information. But if you are referring to the way this information has been assigned to the word form, I have to say that this has been done in an automatic way.

Is the extension of the lexicon performed in an automatic way or in a manual one?

The extension of the lexicon has been performed manually.

How many members are there in the case system of Romanian?

Leaving aside the vocative, there are four cases in Romanian: nominative, accusative, genitive, and dative.

I tried to access the page of the tokenizer and found nothing there. Did the address change in the meantime?

As far as I know, the address is the same. Try again!

What is the utility of the morphological analyzer?

The analyzer provides the information required in connection with a given word form.

Is the tokenizer language-independent ?

The tokenizer is language-independent in the sense that, if you give it a training corpus from a language different from Romanian, it will be able to perform the same task as the one for Romanian.

What is the utility of a tokenizer in text processing?

The tokenizer helps you to extract lexical items from a text faster and easier than in the manual way.

How do you analyze compound words?

A compound word is considered a single lexical item, however composed of other words. We mark compound words with underscore: nici_un (no one).

Why do you maintain the distinction between lemma features and word form features?

This is because, in one case a feature only characterizes the word form, while in another case it is specific to the lemma itself. For instance, the gender of the noun is a lemma feature, because it does not determine the inflection. Nevertheless, the gender of the adjective does determine the inflection, so it is a word form feature.

You work with the distinction articled / non-articled noun, but as for the articled ones you leave aside the distinction definite / non-definite. Why?

Good question! We have to incorporate this pair of features, too, because it determines the inflection of the nouns.