The Way to Pro Full Stack

Do things in a simple way

An implementation of CJK MMSEG tokenizer algorithm with Sogou Chinese Dictionary.

Implementation Details:

  • Word dictionary is based on Patricia Trie, so the dictionary is efficient when dealing with CJK languages and updating word dictionary dinamically is also possible.
  • For English and special characters, unicode uax #29 is used, and the implementation is JDK’s java.text package. So given a piece of text, first we use unicode uax #29 to tokenize to get English words, special characters and CJK sentences, then we use MMSEG to tokenize CJK words based on Sogou Chineses dictionary.

How to use:

Cjk-mmseg is hosted on Jcenter:


compile 'com.profullstack:cjk-mmseg:0.0.1'



Then check the test code in

        CjkMmseg seg = new CjkMmseg();
        String s = "My email address is。我的用户名是christian,我的邮箱是";
        Reader r = new StringReader(s);
        Word w;
        while((w = seg.nextWord()) != null){


text: My startOffset: 0 endOffset: 1
text: email startOffset: 3 endOffset: 7
text: address startOffset: 9 endOffset: 15
text: is startOffset: 17 endOffset: 18
text: christian.xiao startOffset: 20 endOffset: 33
text: startOffset: 35 endOffset: 45
text: 我的 startOffset: 47 endOffset: 48
text: 用户名 startOffset: 49 endOffset: 51
text:  startOffset: 52 endOffset: 52
text: christian startOffset: 53 endOffset: 61
text: 我的邮箱 startOffset: 63 endOffset: 66
text:  startOffset: 67 endOffset: 67
text: christian.xiao startOffset: 68 endOffset: 81
text: startOffset: 83 endOffset: 93