Project Statement
Project Title: Developing a Korean Resource Grammar and
Implementing It into the Linguistic
Background: The advent of the
information era in this century has escalated the importance of processing linguistic information more
precisely and
correctly. Recent developments in artificial intelligence, information sciences
and other
high technology activities have made it possible to build feasible
computational applications for language processing and
understanding.
Such applications (e.g. message extraction
systems, web-based search engines, machine translation and dialogue
understanding systems) demand increasing accuracy and robustness of
the grammar (or parsers) combined with sophisticated statistical processing
methods.
When
considering the reality that the basic units in understanding language are
sentences, we
could not miss the fact that building a reliable syntactic and semantic
parser is a prerequisite for language processing. Although there
have been several successful morphological analyzers developed for Korean,
no serious attempts have been made to build its syntactic or semantic
parser(s), partly because of its structural complexity and partly
because of the existence of no reliable grammar-build up system.
As observed by Kang (1998), the research on syntactic and semantic
processing in
|
English |
Korean |
Morphological Analyzers |
Application Stage |
Application Stage |
Corpus |
Application Stage |
Research and Application Stage |
Syntax/Semantic Parser |
Research and Application Stage |
Research Stage |
As
represented in the table, the research for the
development of English syntactic and semantic parsers has reached a significant
level that can even allow real-time applications. For example, in
past projects,
the ERG (English Resource Grammar), a part of the LinGO project at
CSLI (Center for the Study of Language and Information), was used in the Verbmobil
machine translation system and in an NSF-funded project on
computer-aided speech generation for people who cannot speak because of
disability (cf. Copestake and Flickinger 2000, Flickinger 2002). However, in Korea there exist few reliable applications
in particular for English and Korean or vice versa.
The
urgent need to advance the lagging research for Korean syntactic/semantic parser
provides the very motivation for this project.
Objectives: The objective of this project is thus
to build a
general purpose system for processing the Korean language
that will support
both research and practical applications. The goal includes building a broad-coverage Korean
grammar that can be used both to extract precise meanings from text
input and to generate well-formed text output. To achieve this goal, the
project will develop a computationally feasible Korean
Resource Grammar and implement it into the LKB (Linguistic Knowledge
Building) system developed by the LingGO
(Linguistic Grammar Online) Lab researchers at the CSLI
(Center for the Study of Language and Information)..
Methodology:
Korean
Resource Grammar: The Korean Resource Grammar (KRG) to be
developed in this project is a computational grammar for Korean
currently under development since October 2002. Its aim is to develop an open source
grammar of Korean. The grammatical framework for the KRG is the constraint-based
grammar, HPSG (Pollard and Sag 1994, Sag, Wasow,
and Bender 2003). HPSG (Head-driven Phrase Structure
Grammar) is
built upon a non-derivational, constraint-based, and surface-oriented grammatical
architecture. Though HPSG shares
with the so-called P&P (Principles and Parameters) grammar the idea that interaction
between lexical entries and a set of parameterized principles
determines grammatical well-formedness, it has
one fundamental architectural difference: there are no derivational
or transformational
operations involved. Unlike the P&P framework, where distinct levels of syntactic structure are
sequentially derived by means of the transformational operation Move-α(affecting both phrasal categories and
heads), HPSG has no notion of deriving one structure from another
structure. It employs a concrete conception of constituent
structures, a limited set of universal principles (e.g., the Head Feature Principle,
the Valence Principle, etc.), and enriched lexical representations
(cf. Pollard and Sag 1994, Kim 2002, Sag, Wasow, and
Bender 2003).
In addition, HPSG is a constraint-based, lexicalist
approach to grammatical theory that seeks to model human languages as
systems of constraints on typed feature structures. In particular, the
grammar adopts the mechanism of type hierarchy in which every linguistic sign is typed with appropriate
constraints and hierarchically organized. The characteristic of such typed feature
structure formalisms facilities the extension of grammar in a systematic
and efficient
way, resulting in linguistically precise and theoretically motivated
descriptions of languages including Korean. The grammar HPSG is
thus well suited to the task of multilingual development of broad
coverage grammars.
The
Korean Resource Grammar, developed as an extension of HPSG, will also have
a broad lexical and syntactic
coverage, such that it will be possible to use it in application products such
as an automatic
email response system. In addition, the grammar
adopts a flat
semantic formalism Minimal Recursion Semantics (MRS) in representing
semantics (Copestake et al. 2001). MRS offers an interface between syntax
and semantics
using feature structures. The formalism has syntactically flat
structures and offers at the same time the possibility of the handling
of scope relations. The semantics is being developed in close cooperation
with the LinGO English Resource Grammar.
Grammar
Tool Writing: The basic tool for writing, testing, and processing the Korean
Resource Grammar is the LKB (
Status
Quo of the Korean Resource Grammar: The Korean Resource Grammar consists of grammar rules, inflection rules,
lexical rules, type definitions, and lexicon. At this stage it includes a lexicon of some
500 words whose properties are organized in a hierarchy of
about a hundred
types and 300 sentences. The grammar also includes a
limited set of phrasal rules and types organized in a type hierarchy,
providing
coverage of the most familiar phenomena found in ordinary
Korean (Kim 2003a, 2003b).
One
example will suffice to demonstrate the efficiency of the grammar developed so far.
One of the most complicated facts in Korean is that it allows sentence
internal free scrambling. For example, observe the
sentence (1):
(1) mayil John-un
haksayng-tul-eykey yenge-lul [kaluchi-ess-ta]
Everyday John-Top students-Pl-Dat English-Acc teach-Past-Decl
‘John taught English to the students everyday.’
The five syntactic elements here can induce 24 (4!) different
scrambling possibilities. A few ordering possibilities are
given here:
(2) a. John-i mayil
haksayng-tul-eykey yenge-lul kaluchiessta.
b. haksayng-tul-eykey
John- i mayil yenge-lul kaluchiessta
c. yenge-lul
John- i mayil
haksayng-tul-eykey kaluchiessta.
d. mayil haksayng-tul-eykey
John- i yenge-lul kaluchiessta.
e. haksayng-tul-eykey John- i mayil yenge-lul kaluchiessta.
f. ...
A
most effective grammar would no doubt be the one that can capture all such
scrambling possibilities within a minimal processing load. In the KRG
at this stage,
this flexibility of Korean syntax is captured by the
interactions between lexical information and a limited set of the well-formed phrase
conditions. Different from English (and from the Japanese grammar JACY
of Siegel and
Bender 2002, Siegel 2000), the KRG assumes just the following informally represented
phrasal well-formed conditions:
(3) Korean X' Syntax
(simplified):
a. hd-arg-ph: XP => [1], H[ARG-ST < ... [1]...>]
b. hd-mod-ph: XP =>
[MOD [1], H[1]
c. hd-filler-ph: XP => [1], H[GAP <[1]>]
d. hd-word-ph: X[LEX +] => [word], H
(3a) means that when a head combines with one of its arguments, the resulting
phrase is a well-formed phrase. (3b) allows a head to combine with a phrase
that modifies it. (3c) is a constraint for a head to form a phrase (with a missing gap) with a filler. (3d)
basically
generates a word level syntactic element by the combination of a head and
a word.
This well-formed phrase condition, not found in languages like English, forms various
types of complex predicates frequently found in the language. This simple X’ syntax
generates either unary or binary syntactic structures and thus can capture the major
syntactic structures including scrambling cases in a straightforward
manner.
To
be more formal, in the implementation of the KRG into the
LKB, the phrase condition of hd-arg-ph (head-argument-phrase) is
written as follows:
head-arg-rule-1 := hd-arg-ph &
[ SYN.VAL.ARG-ST
#2,
ARGS < #1 & [ SYN.HEAD.PRD - ],
syn-st & [
SYN.VAL.ARG-ST [ FIRST #1,
REST #2 ] ] >
].
This description, a direct translation of the KRG,
specifies that there are two elements in the ARGS list, the second element of which represents
the head and selects two #1 and #2 as its arguments. When this
head combines with #1, the resulting phrase requires only #2. This
eventually allows the arguments to be discharged one by one, generating
binary structures. This condition, combined with the hd-mod-ph, can generate all the 24 word order possibilities
for the sentence (1). The following examples are the actual parsed tree of the
sentence (2a) in the current KRG system and its MRS semantic representation.
As
observed here, the LKB is a powerful and efficient tool that allows a hands-on implementation of the Korean
Resource grammar, built upon the typed feature structure formalisms of HPSG.
Areas
to be developed during the proposed project period: The Korean Resource Grammar
has been under development since October 2002 as a collaboration work
with a computer scientist at the
Even
though we have reached an unexpectedly high coverage of real data, considering the short
period of research, there of course remain many areas asking for further
development. These can be summarized as follows:
►refining the current Korean Resource Grammar (pro drop, case, relativization,
light verb constructions, etc.)
►developing fine-grained semantics using MRS that can capture scope, event structures,
message types, linking between syntax and semantics)
►incrementally increasing coverage of clause internal syntax in Korean (e.g.,
coordination, different types of
long-distance dependencies, pro drop, honorification,
tense, aspect, coordination, etc.)
►incorporating the use of default entries for words unknown to the Korean HSPG lexicon
►testing with real-time corpora and expanding more coverage
►linking the Korean Resource Grammar and
the English Resource Grammar for applications such as machine translation.
Significance: The successful completion
of this project would bring us the following results:
► A precise Korean grammar that
can be used general purposes including research and computational
purposes. Considering that the research in Korean grammar has been
dominated by the Principles and Parameters framework, this new
constraint-based grammar would theoretically broaden perspectives on
the languages (English and Korean). In addition, this would boost research
in the computational implementations of the
Korean grammar.
►A syntactic parser that parallels with semantic representations. Previous research for building a
parsing system has been focused only on syntactic aspects, not paying much
attention to robust semantic representations. This has made difficulties
in further
developing real-time applications such as a large scale written and
spoken language translation. The syntax-semantics system this
project aims for will narrow this gap.
►A sentence generator that can be
used for a speech prosthesis system. One peculiar property of the LinGO system is that it allows generation
of sentences, a
process that can hardly be found in the existing systems. The generation
system could
lead to the development of computer-aided speech generation for
people who cannot speak because of disability.
Evaluation and dissemination: The first step in
evaluating our system is the SERI test suite built by the researchers in
the ETRI. These 600 sample sentences are key sentences that
the literature
on Korean linguistics has most frequently discussed. The
current KRG covers about 60% of the sentences. The project will
increase its full coverage with the proper semantics as well. In addition to
this test
suite, the project result will be evaluated by the Syntactic Structure Corpus built by
the Sejong 21 Project.
The
project results will be presented in international conferences on
linguistics as
well as on computational linguistics such as COLING, PACLIC, ACL, etc.
More importantly, at the end of this project
(August 2005), all the results of this research will be put on-line (temporary site:
web.khu.ac.kr/~jongbok/projects.html). We cannot deny that most
of the previous
research
results --
in particular the source files of Korean parsing systems -- have been off-limits to other researchers
and even been confidential. The open source policy of this project will surely
help researchers and students in the field and encourage further research.
Selected
References:
Copestake,
Ann. 2002. Implementing
Type Feature Structures. CSLI Publications.
Copestake, Ann and
Dan Flickinger. 2000. An open-source grammar development environment and
broad-coverage English grammar using HPSG. In Proceedings of the Second conference on
Language Resources and Evaluation (LREC-2000).
Copestake, Ann,
Dan Flickinger, Ivan A. Sag and Carl J.
Pollard. 2001. Minimal Recursion Semantics: An Introduction.
Ms. Stanford University.
Flickinger Dan. 2002. On building a more
efficient grammar by exploiting
types. In Stephan Oepen, Dan
Flickinger, Jun'ichi Tsujii and
Hans Uszkoreit (eds.) Collaborative Language Engineering. Stanford:
CSLI Publications, pp. 1-17.
Kang, Sung-Sik.
1998. Problems in Korean Language Processing and Methods. Paper Presented in
the 8th Korean Language Processing Conference (in Korean).
Kim, Jong-Bok.
2000. The Grammar of Negation: A
Constraint-Based Approach. Stanford: CSLI Publications.
Kim, Jong-Bok and Jaehyung
Yang. 2003a. `Korean Phrase Structure Grammar in Constraint Based Grammar and
Building a Syntactic Parser with the Linguistic
Kim, Jong-Bok and Jaehyung
Yang. 2003b. Korean Phrase Structure Grammar and Its Implementations into the LKB
System. Paper to be Presented at the 17th Pacific Asia
Conference on Language, Information, and Computation,
Pollard, Carl and Ivan Sag.
1994. Head-driven Phrase Structure
Grammar.
Sag, Ivan, Tom Wasow,
Emily Bender. 2003. Syntactic
Theory: A Formal Introduction (2nd ed).
CSLI Publications.
Siegel,
Melanie. 2000. HPSG Analysis of Japanese. In W.Wahlster (ed.), Verbmobil: Foundations of Speech-to-Speech Translation. Springer Verlag.
Siegel, Melanie and Emily M. Bender. 2002. Efficient Deep Processing of Japanese. In Proceedings of the 3rd
Workshop on Asian Language Resources and International Standardization. Coling 2002 Post-Conference Workshop.