Unlike English or other western languages, written Chinese does not have explicit boundary to delimit words such as blank space. Thus Chinese Word Segmentation (CWS) is the fundamental task and an acknowledged problem of Chinese natural language processing. CWS is defined as the process of transforming Chinese text from sequences of characters into sequences of words. In this research project, state of the art of CWS is systematically investigated and the real difficulities in CWS research are analyzed; in addition, the work is mainly targeting the segmentation ambiguity, proposed a CWS system including two ideas for pre-segmentation, ambiguity detection and overlapping ambiguity disambiguation. The character-by-character maximum matching method has a better performance than bi-directional maximum matching method and Omni-segmentation method, which is able to detect maximum overlapping ambiguity string (MOAS) and combination ambiguity string (CAS), at the same time to save more cost than Omni-segmentation method. The web-search and rule based disambiguation method has a rational performance for MOAS disambiguation, according to the test result, the precision rate of MOAS disambiguation is about 89.07% by the web-search method; with two rules applied on, the precision rate increases 2.2%.
|Uddannelser||Datalogi, (Bachelor/kandidatuddannelse) Kandidat|
|Udgivelsesdato||31 aug. 2011|
- maximum overlapping ambiguity string
- Chinese Word Segmentation
- combination ambiguity string