I’m looking for a reliable Japanese word segmentation algorithm

I’m working on a book-reading application to help me read Japanese light novels more comfortably. For that, I need a reliable word segmentation algorithm that would take a sentence, split it into words, and preferably assign some information about parts of speech, kana reading, and whatnot.

The important part here is that it should be able to split the text into full words and not morphemes like other morphological analyzers such as MeCab do `(“覗き込んでいる” instead of “覗き”, “込ん”, “で”, “いる”)`.

For example this sentence:

“抜けるような青空をバックに、才人の顔をまじまじと覗き込んでいる女の子が言った。”

Should be split into:

[ “抜ける”, “ような”, “青空”, “を”, “バック”, “に”, “、”, “才人”, “の”, “顔”, “を”, “まじまじ”, “と”, “覗き込んでいる”, “女の子”, “が”, “言った。”]

I’ve been searching for quite some time now but perhaps because I’m new to NLP and morphological analysis I couldn’t find anything that would reliably work. The closest solution I found was Ichiran ([ichi.moe](https://ichi.moe)) but it’s written in lisp and it’s kinda slow. I also noticed that all of these algorithms are quite old now ( especially Mecab) so something modern like neural networks would be nice.

​

If you know of any such algorithm/implementation that would help heaps. Any suggestions are welcomed. Thanks.

2 comments
  1. Are you asking for a website that will do this for you or are you trying to program something? If it’s the latter, I would use yomichan’s approach of matching a segment of text against the longest applicable result in a dictionary.

  2. I mean calling 覗き込んでいる a single word is an opinionated position tbh. There are pretty good arguments for describing this as two (覗き込んで、いる) (regarding 覗き込む as a compound word) or three (覗き、込んで、いる) (splitting it into the smallest independent units that can stand on their own). Personally I’d lean towards two words, not that any of these are necessarily right or wrong. The problem with the idea of words is that linguists don’t have a well defined definition of a word, so it’s hard to say what is and isn’t a single word. On the other hand, morphemes are much more well defined. This is presumably why most segmentation algorithms split Japanese text into morphemes. (Also side note, in your example, you’ve split off all the particles as their own words except for ような. Splitting off the particles is a fairly reasonable thing to do for your intended application, but it feels weird, since even when Japanese is written with spaces (e.g. Pokemon games), people usually don’t put spaces between a thing and its corresponding particle, I don’t think (maybe I’m misremembering).)

    My advice (if you can’t find a word segmentation algorithm that agrees with your intuition/needs) would be to start with a morpheme segmentation and process that to reduce it to a notion of word segmentation that agrees with your intuition.

Leave a Reply
You May Also Like