Constraint-based Word Segmentation for Chinese

Research output: Chapter in Book/Report/Conference proceedingBook chapterResearch

Abstract

Written Chinese text has no separators between words in the same way as European languages use space characters, and this creates the Chinese Word Segmentation Problem, CWSP: given a text in Chinese, divide it in a correct way into segments corresponding to words. Good solutions are in demand for virtually any nontrivial computational processing of Chinese text, ranging from spellchecking over internet search to deep analysis.

Isolating the single words is usually the first phase in the analysis of a text, but as for many other language analysis tasks, to do that perfectly, an insight in syntactic and pragmatic content of the text is essentially required. While this parallelism is easy for competent human language user, computer-based methods tend to be separated into
phases with little or no interaction. Accepting this as a fact, means that CWSP introduces a playground for a plethora of different ad-hoc and statistically based methods.

In this paper, we show experiments of implementing different approaches to CWSP in the framework of CHR Grammars [Christiansen, 2005] that provides a constraint solving approach to language analysis. CHR Grammars are based upon Constraint Handling Rules, CHR [Frühwirth, 1998, 2009], which is a declarative, high-level programming language for specification and implementation of constraint solvers.
Original languageEnglish
Title of host publicationConstraints and Language
EditorsPhilippe Blache, Henning Christiansen, Veronica Dahl, Denys Duchier , Jørgen Villadsen
PublisherCambridge Scholars Publishing
Publication date2014
Pages237-251
Chapter11
ISBN (Print)978-1-4438-6052-9
Publication statusPublished - 2014

Cite this

Christiansen, H., & Bo, L. (2014). Constraint-based Word Segmentation for Chinese. In P. Blache, H. Christiansen, V. Dahl, D. D., & J. Villadsen (Eds.), Constraints and Language (pp. 237-251). Cambridge Scholars Publishing.