This package makes use of the CRFClassifier class (a conditional random field sequence classifier) to do Chinese word segmentation.
On the Stanford NLP machines, usable properties files can be found at:
/u/nlp/data/chinese-segmenter/Sighan2005/prop
Usage: For simplified Chinese:
java -mx200m edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict $CH_SEG/data -NormalizationTable $CH_SEG/data/norm.simp.utf8 -normTableEncoding UTF-8 -loadClassifier $CH_SEG/data/ctb.gz -testFile $file -inputEncoding $enc
@author Pi-Chuan Chang
@author Huihsin Tseng
@author Galen Andrew