Project Statement

 

 

Project Title:   Developing a  Korean Resource Grammar and Implementing It into the Linguistic

                  

Knowledge Building System (LKB)

                  

 

                                                                                                                                                           

 

Background: The advent of the information era in this century has escalated the importance of processing linguistic information more precisely and correctly. Recent developments in artificial intelligence, information sciences and other high technology activities have made it possible to build feasible computational applications for language processing and understanding. Such applications (e.g. message extraction systems, web-based search engines, machine translation and dialogue understanding systems) demand increasing accuracy and robustness of the grammar (or parsers) combined with sophisticated statistical processing methods.

 

When considering the reality that the basic units in understanding language are sentences, we could not miss the fact that building a reliable syntactic and semantic parser is a prerequisite for language processing. Although there have been several successful morphological analyzers developed for Korean, no serious attempts have been made to build its syntactic or semantic parser(s), partly because of its structural complexity and partly because of the existence of no reliable grammar-build up system. As observed by Kang (1998), the research on syntactic and semantic processing in Korea is at the beginning stage and at least 10 to 15 years behind compared to the one for English:

 

 

 

English

 

Korean

Morphological

Analyzers

 

Application Stage

 

 Application Stage

 

Corpus

 

Application Stage

Research and Application  Stage

Syntax/Semantic Parser

Research and Application

Stage

 

Research Stage

 

As represented in the table, the research for the development of English syntactic and semantic parsers has reached a significant level that can even allow real-time applications. For example, in past projects, the ERG (English Resource Grammar), a part of the LinGO project at CSLI (Center for the Study of Language and Information), was used in the Verbmobil machine translation system and in an NSF-funded project on computer-aided speech generation for people who cannot speak because of disability (cf. Copestake and Flickinger 2000, Flickinger 2002). However, in Korea there exist few reliable applications in particular for English and Korean or vice versa.

 

The urgent need to advance the lagging research for Korean syntactic/semantic parser provides the very motivation for this project.

 

Objectives:  The objective of this project is thus to build a general purpose system for processing the Korean language that will support both research and practical applications. The goal includes building a broad-coverage Korean grammar that can be used both to extract precise meanings from text input and to generate well-formed text output. To achieve this goal, the project will develop a computationally feasible Korean Resource Grammar and implement it into the LKB (Linguistic Knowledge Building) system developed by the LingGO (Linguistic Grammar Online) Lab researchers at the CSLI (Center for the Study of Language and Information)..

 

Methodology:

 

Korean Resource Grammar: The Korean Resource Grammar (KRG) to be developed in this project is a computational grammar for Korean currently under development since October 2002.  Its aim is to develop an open source grammar of Korean. The grammatical framework for the KRG is the constraint-based grammar, HPSG (Pollard and Sag 1994, Sag, Wasow, and Bender 2003). HPSG (Head-driven Phrase Structure Grammar) is built upon a non-derivational, constraint-based, and surface-oriented grammatical architecture.  Though HPSG shares with the so-called P&P (Principles and Parameters) grammar the idea that interaction between lexical entries and a set of parameterized principles determines grammatical well-formedness, it has one fundamental architectural difference: there are no derivational or transformational operations involved. Unlike the P&P framework, where distinct levels of syntactic structure are sequentially derived by means of the transformational operation Move-α(affecting both phrasal categories and heads), HPSG has no notion of deriving one structure from another structure. It employs a concrete conception of constituent structures, a limited set of universal principles (e.g., the Head Feature Principle, the Valence Principle, etc.), and enriched lexical representations (cf. Pollard and Sag 1994, Kim 2002, Sag, Wasow, and Bender 2003).

 

In addition, HPSG is a constraint-based, lexicalist approach to grammatical theory that seeks to model human languages as systems of constraints on typed feature structures. In particular, the grammar adopts the mechanism of type hierarchy in which every linguistic sign is typed with appropriate constraints and hierarchically organized.  The characteristic of such typed feature structure formalisms facilities the extension of grammar in a systematic and efficient way, resulting in linguistically precise and theoretically motivated descriptions of languages including Korean. The grammar HPSG is thus well suited to the task of multilingual development of broad coverage grammars.

 

The Korean Resource Grammar, developed as an extension of HPSG, will also have a broad lexical and syntactic coverage, such that it will be possible to use it in application products such as an automatic email response system. In addition, the grammar adopts a flat semantic formalism Minimal Recursion Semantics (MRS) in representing semantics (Copestake et al. 2001).  MRS offers an interface between syntax and semantics using feature structures. The formalism has syntactically flat structures and offers at the same time the possibility of the handling of scope relations. The semantics is being developed in close cooperation with the LinGO English Resource Grammar.

 

Grammar Tool Writing: The basic tool for writing, testing, and processing the Korean Resource Grammar is the LKB (Linguistic Knowledge Building) system (Copestake 2002). The LKB system is a grammar and lexicon development environment for use with constraint-based linguistic formalisms such as HPSG. Both are freely available with open source (http://ling.stanford.edu). The LKB also provides an efficient parser and generator.

 

Status Quo of the Korean Resource Grammar: The Korean Resource Grammar consists of grammar rules, inflection rules, lexical rules, type definitions, and lexicon. At this stage it includes a lexicon of some 500 words whose properties are organized in a hierarchy of about a hundred types and 300 sentences. The grammar also includes a limited set of phrasal rules and types organized in a type hierarchy, providing coverage of the most familiar phenomena found in ordinary Korean (Kim 2003a, 2003b).

 

One example will suffice to demonstrate the efficiency of the grammar developed so far. One of the most complicated facts in Korean is that it allows sentence internal free scrambling. For example, observe the sentence (1):

 

(1)  mayil         John-un            haksayng-tul-eykey  yenge-lul      [kaluchi-ess-ta]

Everyday   John-Top         students-Pl-Dat         English-Acc   teach-Past-Decl

John taught English to the students everyday.

 

The five syntactic elements here can induce 24 (4!) different scrambling possibilities. A few ordering possibilities are given here:

 

(2) a. John-i mayil haksayng-tul-eykey yenge-lul kaluchiessta.

b. haksayng-tul-eykey John- i mayil  yenge-lul kaluchiessta

c. yenge-lul John- i    mayil haksayng-tul-eykey kaluchiessta.

d. mayil  haksayng-tul-eykey John- i  yenge-lul kaluchiessta.

e. haksayng-tul-eykey John- i  mayil  yenge-lul kaluchiessta.

f. ...

 

A most effective grammar would no doubt be the one that can capture all such scrambling possibilities within a minimal processing load. In the KRG at this stage, this flexibility of Korean syntax is captured by the interactions between lexical information and a limited set of the well-formed phrase conditions. Different from English (and from the Japanese grammar JACY of Siegel and Bender 2002, Siegel 2000), the KRG assumes just the following informally represented phrasal well-formed conditions:

 

 

 

(3)  Korean X' Syntax (simplified):

 

a. hd-arg-ph:  XP  => [1], H[ARG-ST  < ... [1]...>]

b. hd-mod-ph:  XP => [MOD [1], H[1]

c. hd-filler-ph:  XP =>   [1], H[GAP <[1]>]

d. hd-word-ph:  X[LEX +] =>  [word], H

 

(3a) means that when a head combines with one of its arguments, the resulting phrase is a well-formed phrase.  (3b) allows a head to combine with a phrase that modifies it. (3c) is a constraint for a head to form a phrase (with a missing gap) with  a filler. (3d) basically generates a word level syntactic element by the combination of a head and a word. This well-formed phrase condition, not found in languages like English, forms various types of complex predicates frequently found in the language. This simple X syntax generates either unary or binary syntactic structures and thus can capture the major syntactic structures including scrambling cases in a straightforward manner.

 

To be more formal, in the implementation of the KRG into the LKB,  the phrase condition of hd-arg-ph (head-argument-phrase) is written as follows:

 

head-arg-rule-1 := hd-arg-ph &

 [ SYN.VAL.ARG-ST #2,

   ARGS < #1 & [ SYN.HEAD.PRD - ],

          syn-st & [ SYN.VAL.ARG-ST [ FIRST #1,

                                      REST  #2 ] ] > ].

 

 

This description, a direct translation of the KRG, specifies that there are two elements in the ARGS list, the second element of which represents the head and selects two  #1 and #2 as its arguments. When this head combines with #1, the resulting phrase requires only #2. This eventually allows the arguments to be discharged one by one, generating binary structures. This condition, combined with the hd-mod-ph, can generate all the 24 word order possibilities for the sentence (1). The following examples are the actual parsed tree of the sentence (2a) in the current KRG system and its MRS semantic representation.

 

 

 

 

 

 

 

As observed here, the LKB is a powerful and efficient tool that allows a hands-on implementation of the Korean Resource grammar, built upon the typed feature structure formalisms of HPSG.

 

Areas to be developed during the proposed project period: The Korean Resource Grammar has been under development since October 2002 as a collaboration work with a computer scientist at the School of Computer Science, Kangnam University.  It currently covers phenomena such as basic clause syntax, free word order (scrambling), case marking, adverb modification, topicalization, relative clauses, auxiliary constructions, light verb constructions, and complex sentences, among others. The results have been presented in a LinGO lab weekly meeting in February 2003 and a linguistic conference here in Korea, and they will also be presented in an international conference this coming October. The two previous reports of the on-going projects have greatly impressed the researchers in the relevant field and have received strongly positive responses. In particular, central attention has come from the preciseness of the grammar and its efficient parsing results: The mean edges for the 300 sample sentences is only 1.48, much lower than the current existing systems. The first phase of the current LinGO Korean Resource Grammar has thus achieved impressive coverage of major constructions in the language in question, providing a promising future direction.

 

Even though we have reached an unexpectedly high coverage of real data, considering the short period of research, there of course remain many areas asking for further development. These can be summarized as follows:

 

refining the current Korean Resource Grammar (pro drop, case, relativization, light verb constructions, etc.)

 

developing fine-grained semantics using MRS that can capture scope, event structures, message types, linking between syntax and semantics)


incrementally increasing coverage of clause internal syntax in Korean (e.g., coordination, different types of

long-distance dependencies, pro drop, honorification, tense, aspect, coordination, etc.)

 

incorporating the use of default entries for words unknown to the Korean HSPG lexicon

 

testing with real-time corpora and expanding more coverage

 

linking the Korean Resource Grammar and the English Resource Grammar for applications such as machine translation.

 

 

Significance: The successful completion of this project would bring us the following results:

 

A precise Korean grammar that can be used general purposes including research and computational purposes. Considering that the research in Korean grammar has been dominated by the Principles and Parameters framework, this new constraint-based grammar would theoretically broaden perspectives on the languages (English and Korean).  In addition, this would boost research in the computational implementations of the Korean grammar.

 

A syntactic parser that parallels with semantic representations. Previous research for building a parsing system has been focused only on syntactic aspects, not paying much attention to robust semantic representations. This has made difficulties in further developing real-time applications such as a large scale written and spoken language translation. The syntax-semantics system this project aims for will narrow this gap.

 

A sentence generator that can be used for a speech prosthesis system. One peculiar property of the LinGO system is that it allows generation of sentences, a process that can hardly be found in the existing systems. The generation system could lead to the development of computer-aided speech generation for people who cannot speak because of disability.

 

 

Evaluation and dissemination: The first step in evaluating our system is the SERI test suite built by the researchers in the ETRI. These 600 sample sentences are key sentences that the literature on Korean linguistics has most frequently discussed. The current KRG covers about 60% of the sentences. The project will increase its full coverage with the proper semantics as well. In addition to this test suite, the project result will be evaluated by the Syntactic Structure Corpus built by the Sejong 21 Project.

 

The project results will be presented in international conferences on linguistics as well as on computational linguistics such as COLING, PACLIC, ACL, etc.

 

More importantly, at the end of this project (August 2005), all the results of this research will be put on-line (temporary site: web.khu.ac.kr/~jongbok/projects.html). We cannot deny that most of the previous research results -- in particular the source files of Korean parsing systems -- have been off-limits to other researchers and even been confidential. The open source policy of this project will surely help researchers and students in the field and encourage further research.

 

 

 

 

Selected References:

 

 

Copestake, Ann. 2002. Implementing Type Feature Structures. CSLI Publications.

 

Copestake, Ann and Dan Flickinger. 2000. An open-source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the Second conference on Language Resources and Evaluation (LREC-2000). Athens, Greece.

 

Copestake, Ann,  Dan Flickinger, Ivan A. Sag and Carl J. Pollard. 2001. Minimal Recursion Semantics: An Introduction. Ms. Stanford University.

 

Flickinger Dan. 2002. On building a more efficient grammar by exploiting types. In Stephan Oepen, Dan Flickinger, Jun'ichi Tsujii and Hans Uszkoreit (eds.) Collaborative Language Engineering. Stanford: CSLI Publications, pp. 1-17.

 

Kang, Sung-Sik. 1998. Problems in Korean Language Processing and Methods. Paper Presented in the 8th Korean Language Processing Conference (in Korean).  

 

Kim, Jong-Bok. 2000. The Grammar of Negation: A Constraint-Based Approach. Stanford: CSLI Publications.

 

Kim, Jong-Bok and Jaehyung Yang. 2003a. `Korean Phrase Structure Grammar in Constraint Based Grammar and Building a Syntactic Parser with the Linguistic Knowledge Building System, (In Korean). Korean Linguistics 20: 1-40.

 

Kim, Jong-Bok and Jaehyung Yang. 2003b. Korean Phrase Structure Grammar and Its Implementations into the LKB System. Paper to be Presented at the 17th Pacific Asia Conference on Language, Information, and  Computation, Singapore National University, October 1—3, 2003.

 

Pollard, Carl and Ivan Sag. 1994. Head-driven Phrase Structure Grammar. Chicago University Press.

 

Sag, Ivan, Tom Wasow, Emily Bender. 2003. Syntactic Theory: A Formal Introduction (2nd ed). CSLI Publications.

 

Siegel, Melanie. 2000. HPSG Analysis of Japanese. In W.Wahlster (ed.),  Verbmobil: Foundations of Speech-to-Speech Translation. Springer Verlag.

 

Siegel, Melanie and Emily M. Bender. 2002.  Efficient Deep Processing of Japanese. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Coling 2002 Post-Conference Workshop. Taipei, Taiwan.