Project Title: Developing a Korean Resource Grammar and Implementing It into the Linguistic
Background: The advent of the information era in this century has escalated the importance of processing linguistic information more precisely and correctly. Recent developments in artificial intelligence, information sciences and other high technology activities have made it possible to build feasible computational applications for language processing and understanding. Such applications (e.g. message extraction systems, web-based search engines, machine translation and dialogue understanding systems) demand increasing accuracy and robustness of the grammar (or parsers) combined with sophisticated statistical processing methods.
considering the reality that the basic units in understanding language are
could not miss the fact that building a reliable syntactic and semantic
parser is a prerequisite for language processing. Although there
have been several successful morphological analyzers developed for Korean,
no serious attempts have been made to build its syntactic or semantic
parser(s), partly because of its structural complexity and partly
because of the existence of no reliable grammar-build up system.
As observed by Kang (1998), the research on syntactic and semantic
Research and Application Stage
Research and Application
As represented in the table, the research for the development of English syntactic and semantic parsers has reached a significant level that can even allow real-time applications. For example, in past projects, the ERG (English Resource Grammar), a part of the LinGO project at CSLI (Center for the Study of Language and Information), was used in the Verbmobil machine translation system and in an NSF-funded project on computer-aided speech generation for people who cannot speak because of disability (cf. Copestake and Flickinger 2000, Flickinger 2002). However, in Korea there exist few reliable applications in particular for English and Korean or vice versa.
The urgent need to advance the lagging research for Korean syntactic/semantic parser provides the very motivation for this project.
Objectives: The objective of this project is thus to build a general purpose system for processing the Korean language that will support both research and practical applications. The goal includes building a broad-coverage Korean grammar that can be used both to extract precise meanings from text input and to generate well-formed text output. To achieve this goal, the project will develop a computationally feasible Korean Resource Grammar and implement it into the LKB (Linguistic Knowledge Building) system developed by the LingGO (Linguistic Grammar Online) Lab researchers at the CSLI (Center for the Study of Language and Information)..
Korean Resource Grammar: The Korean Resource Grammar (KRG) to be developed in this project is a computational grammar for Korean currently under development since October 2002. Its aim is to develop an open source grammar of Korean. The grammatical framework for the KRG is the constraint-based grammar, HPSG (Pollard and Sag 1994, Sag, Wasow, and Bender 2003). HPSG (Head-driven Phrase Structure Grammar) is built upon a non-derivational, constraint-based, and surface-oriented grammatical architecture. Though HPSG shares with the so-called P&P (Principles and Parameters) grammar the idea that interaction between lexical entries and a set of parameterized principles determines grammatical well-formedness, it has one fundamental architectural difference: there are no derivational or transformational operations involved. Unlike the P&P framework, where distinct levels of syntactic structure are sequentially derived by means of the transformational operation Move-α(affecting both phrasal categories and heads), HPSG has no notion of deriving one structure from another structure. It employs a concrete conception of constituent structures, a limited set of universal principles (e.g., the Head Feature Principle, the Valence Principle, etc.), and enriched lexical representations (cf. Pollard and Sag 1994, Kim 2002, Sag, Wasow, and Bender 2003).
In addition, HPSG is a constraint-based, lexicalist approach to grammatical theory that seeks to model human languages as systems of constraints on typed feature structures. In particular, the grammar adopts the mechanism of type hierarchy in which every linguistic sign is typed with appropriate constraints and hierarchically organized. The characteristic of such typed feature structure formalisms facilities the extension of grammar in a systematic and efficient way, resulting in linguistically precise and theoretically motivated descriptions of languages including Korean. The grammar HPSG is thus well suited to the task of multilingual development of broad coverage grammars.
The Korean Resource Grammar, developed as an extension of HPSG, will also have a broad lexical and syntactic coverage, such that it will be possible to use it in application products such as an automatic email response system. In addition, the grammar adopts a flat semantic formalism Minimal Recursion Semantics (MRS) in representing semantics (Copestake et al. 2001). MRS offers an interface between syntax and semantics using feature structures. The formalism has syntactically flat structures and offers at the same time the possibility of the handling of scope relations. The semantics is being developed in close cooperation with the LinGO English Resource Grammar.
Tool Writing: The basic tool for writing, testing, and processing the Korean
Resource Grammar is the LKB (
Status Quo of the Korean Resource Grammar: The Korean Resource Grammar consists of grammar rules, inflection rules, lexical rules, type definitions, and lexicon. At this stage it includes a lexicon of some 500 words whose properties are organized in a hierarchy of about a hundred types and 300 sentences. The grammar also includes a limited set of phrasal rules and types organized in a type hierarchy, providing coverage of the most familiar phenomena found in ordinary Korean (Kim 2003a, 2003b).
One example will suffice to demonstrate the efficiency of the grammar developed so far. One of the most complicated facts in Korean is that it allows sentence internal free scrambling. For example, observe the sentence (1):
(1) mayil John-un haksayng-tul-eykey yenge-lul [kaluchi-ess-ta]
Everyday John-Top students-Pl-Dat English-Acc teach-Past-Decl
‘John taught English to the students everyday.’
The five syntactic elements here can induce 24 (4!) different scrambling possibilities. A few ordering possibilities are given here:
(2) a. John-i mayil haksayng-tul-eykey yenge-lul kaluchiessta.
b. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta
c. yenge-lul John- i mayil haksayng-tul-eykey kaluchiessta.
d. mayil haksayng-tul-eykey John- i yenge-lul kaluchiessta.
e. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta.
A most effective grammar would no doubt be the one that can capture all such scrambling possibilities within a minimal processing load. In the KRG at this stage, this flexibility of Korean syntax is captured by the interactions between lexical information and a limited set of the well-formed phrase conditions. Different from English (and from the Japanese grammar JACY of Siegel and Bender 2002, Siegel 2000), the KRG assumes just the following informally represented phrasal well-formed conditions:
(3) Korean X' Syntax (simplified):
a. hd-arg-ph: XP => , H[ARG-ST < ... ...>]
b. hd-mod-ph: XP => [MOD , H
c. hd-filler-ph: XP => , H[GAP <>]
d. hd-word-ph: X[LEX +] => [word], H
(3a) means that when a head combines with one of its arguments, the resulting phrase is a well-formed phrase. (3b) allows a head to combine with a phrase that modifies it. (3c) is a constraint for a head to form a phrase (with a missing gap) with a filler. (3d) basically generates a word level syntactic element by the combination of a head and a word. This well-formed phrase condition, not found in languages like English, forms various types of complex predicates frequently found in the language. This simple X’ syntax generates either unary or binary syntactic structures and thus can capture the major syntactic structures including scrambling cases in a straightforward manner.
To be more formal, in the implementation of the KRG into the LKB, the phrase condition of hd-arg-ph (head-argument-phrase) is written as follows:
head-arg-rule-1 := hd-arg-ph &
[ SYN.VAL.ARG-ST #2,
ARGS < #1 & [ SYN.HEAD.PRD - ],
syn-st & [ SYN.VAL.ARG-ST [ FIRST #1,
REST #2 ] ] > ].
This description, a direct translation of the KRG, specifies that there are two elements in the ARGS list, the second element of which represents the head and selects two #1 and #2 as its arguments. When this head combines with #1, the resulting phrase requires only #2. This eventually allows the arguments to be discharged one by one, generating binary structures. This condition, combined with the hd-mod-ph, can generate all the 24 word order possibilities for the sentence (1). The following examples are the actual parsed tree of the sentence (2a) in the current KRG system and its MRS semantic representation.
As observed here, the LKB is a powerful and efficient tool that allows a hands-on implementation of the Korean Resource grammar, built upon the typed feature structure formalisms of HPSG.
to be developed during the proposed project period: The Korean Resource Grammar
has been under development since October 2002 as a collaboration work
with a computer scientist at the
Even though we have reached an unexpectedly high coverage of real data, considering the short period of research, there of course remain many areas asking for further development. These can be summarized as follows:
►refining the current Korean Resource Grammar (pro drop, case, relativization, light verb constructions, etc.)
►developing fine-grained semantics using MRS that can capture scope, event structures, message types, linking between syntax and semantics)
►incrementally increasing coverage of clause internal syntax in Korean (e.g., coordination, different types of
long-distance dependencies, pro drop, honorification, tense, aspect, coordination, etc.)
►incorporating the use of default entries for words unknown to the Korean HSPG lexicon
►testing with real-time corpora and expanding more coverage
►linking the Korean Resource Grammar and the English Resource Grammar for applications such as machine translation.
Significance: The successful completion of this project would bring us the following results:
► A precise Korean grammar that can be used general purposes including research and computational purposes. Considering that the research in Korean grammar has been dominated by the Principles and Parameters framework, this new constraint-based grammar would theoretically broaden perspectives on the languages (English and Korean). In addition, this would boost research in the computational implementations of the Korean grammar.
►A syntactic parser that parallels with semantic representations. Previous research for building a parsing system has been focused only on syntactic aspects, not paying much attention to robust semantic representations. This has made difficulties in further developing real-time applications such as a large scale written and spoken language translation. The syntax-semantics system this project aims for will narrow this gap.
►A sentence generator that can be used for a speech prosthesis system. One peculiar property of the LinGO system is that it allows generation of sentences, a process that can hardly be found in the existing systems. The generation system could lead to the development of computer-aided speech generation for people who cannot speak because of disability.
Evaluation and dissemination: The first step in evaluating our system is the SERI test suite built by the researchers in the ETRI. These 600 sample sentences are key sentences that the literature on Korean linguistics has most frequently discussed. The current KRG covers about 60% of the sentences. The project will increase its full coverage with the proper semantics as well. In addition to this test suite, the project result will be evaluated by the Syntactic Structure Corpus built by the Sejong 21 Project.
The project results will be presented in international conferences on linguistics as well as on computational linguistics such as COLING, PACLIC, ACL, etc.
More importantly, at the end of this project (August 2005), all the results of this research will be put on-line (temporary site: web.khu.ac.kr/~jongbok/projects.html). We cannot deny that most of the previous research results -- in particular the source files of Korean parsing systems -- have been off-limits to other researchers and even been confidential. The open source policy of this project will surely help researchers and students in the field and encourage further research.
Copestake, Ann. 2002. Implementing Type Feature Structures. CSLI Publications.
Copestake, Ann and
Dan Flickinger. 2000. An open-source grammar development environment and
broad-coverage English grammar using HPSG. In Proceedings of the Second conference on
Language Resources and Evaluation (LREC-2000).
Copestake, Ann, Dan Flickinger, Ivan A. Sag and Carl J. Pollard. 2001. Minimal Recursion Semantics: An Introduction. Ms. Stanford University.
Flickinger Dan. 2002. On building a more efficient grammar by exploiting types. In Stephan Oepen, Dan Flickinger, Jun'ichi Tsujii and Hans Uszkoreit (eds.) Collaborative Language Engineering. Stanford: CSLI Publications, pp. 1-17.
Kang, Sung-Sik. 1998. Problems in Korean Language Processing and Methods. Paper Presented in the 8th Korean Language Processing Conference (in Korean).
Kim, Jong-Bok. 2000. The Grammar of Negation: A Constraint-Based Approach. Stanford: CSLI Publications.
Kim, Jong-Bok and Jaehyung
Yang. 2003a. `Korean Phrase Structure Grammar in Constraint Based Grammar and
Building a Syntactic Parser with the Linguistic
Kim, Jong-Bok and Jaehyung
Yang. 2003b. Korean Phrase Structure Grammar and Its Implementations into the LKB
System. Paper to be Presented at the 17th Pacific Asia
Conference on Language, Information, and Computation,
Pollard, Carl and Ivan Sag.
1994. Head-driven Phrase Structure
Sag, Ivan, Tom Wasow, Emily Bender. 2003. Syntactic Theory: A Formal Introduction (2nd ed). CSLI Publications.
Siegel, Melanie. 2000. HPSG Analysis of Japanese. In W.Wahlster (ed.), Verbmobil: Foundations of Speech-to-Speech Translation. Springer Verlag.
Siegel, Melanie and Emily M. Bender. 2002. Efficient Deep Processing of Japanese. In Proceedings of the 3rd
Workshop on Asian Language Resources and International Standardization. Coling 2002 Post-Conference Workshop.