Research on Chinese Word Segmentation and Proposals for Improvement

Bo Li

Studenteropgave: Speciale

Abstrakt

Unlike English or other western languages, written Chinese does not have explicit boundary to delimit words such as blank space. Thus Chinese Word Segmentation (CWS) is the fundamental task and an acknowledged problem of Chinese natural language processing. CWS is defined as the process of transforming Chinese text from sequences of characters into sequences of words. In this research project, state of the art of CWS is systematically investigated and the real difficulities in CWS research are analyzed; in addition, the work is mainly targeting the segmentation ambiguity, proposed a CWS system including two ideas for pre-segmentation, ambiguity detection and overlapping ambiguity disambiguation. The character-by-character maximum matching method has a better performance than bi-directional maximum matching method and Omni-segmentation method, which is able to detect maximum overlapping ambiguity string (MOAS) and combination ambiguity string (CAS), at the same time to save more cost than Omni-segmentation method. The web-search and rule based disambiguation method has a rational performance for MOAS disambiguation, according to the test result, the precision rate of MOAS disambiguation is about 89.07% by the web-search method; with two rules applied on, the precision rate increases 2.2%.

UddannelserDatalogi, (Bachelor/kandidatuddannelse) Kandidat
SprogEngelsk
Udgivelsesdato31 aug. 2011
VejledereHenning Christiansen

Emneord

  • MOAS
  • CAS
  • OOV
  • out-of-vocabulary
  • maximum overlapping ambiguity string
  • Chinese Word Segmentation
  • combination ambiguity string
  • CWS