I'm looking for a reliable Japanese word segmentation algorithm

I’m working on a book-reading application to help me read Japanese light novels more comfortably. For that, I need a reliable word segmentation algorithm that would take a sentence, split it into words, and preferably assign some information about parts of speech, kana reading, and whatnot.

The important part here is that it should be able to split the text into full words and not morphemes like other morphological analyzers such as MeCab do `(“覗き込んでいる” instead of “覗き”, “込ん”, “で”, “いる”)`.

For example this sentence:

“抜けるような青空をバックに、才人の顔をまじまじと覗き込んでいる女の子が言った。”

Should be split into:

[ “抜ける”, “ような”, “青空”, “を”, “バック”, “に”, “、”, “才人”, “の”, “顔”, “を”, “まじまじ”, “と”, “覗き込んでいる”, “女の子”, “が”, “言った。”]

I’ve been searching for quite some time now but perhaps because I’m new to NLP and morphological analysis I couldn’t find anything that would reliably work. The closest solution I found was Ichiran ([ichi.moe](https://ichi.moe)) but it’s written in lisp and it’s kinda slow. I also noticed that all of these algorithms are quite old now ( especially Mecab) so something modern like neural networks would be nice.

If you know of any such algorithm/implementation that would help heaps. Any suggestions are welcomed. Thanks.

2 comments

SinnPacked says:
November 1, 2022 at 1:04 am
Are you asking for a website that will do this for you or are you trying to program something? If it’s the latter, I would use yomichan’s approach of matching a segment of text against the longest applicable result in a dictionary.
jragonfyre says:
November 1, 2022 at 1:20 am
I mean calling 覗き込んでいる a single word is an opinionated position tbh. There are pretty good arguments for describing this as two (覗き込んで、いる) (regarding 覗き込む as a compound word) or three (覗き、込んで、いる) (splitting it into the smallest independent units that can stand on their own). Personally I’d lean towards two words, not that any of these are necessarily right or wrong. The problem with the idea of words is that linguists don’t have a well defined definition of a word, so it’s hard to say what is and isn’t a single word. On the other hand, morphemes are much more well defined. This is presumably why most segmentation algorithms split Japanese text into morphemes. (Also side note, in your example, you’ve split off all the particles as their own words except for ような. Splitting off the particles is a fairly reasonable thing to do for your intended application, but it feels weird, since even when Japanese is written with spaces (e.g. Pokemon games), people usually don’t put spaces between a thing and its corresponding particle, I don’t think (maybe I’m misremembering).)
My advice (if you can’t find a word segmentation algorithm that agrees with your intuition/needs) would be to start with a morpheme segmentation and process that to reduce it to a notion of word segmentation that agrees with your intuition.

You must be logged in to post a comment.

I’m looking for a reliable Japanese word segmentation algorithm

Tags:

2 comments

Leave a Reply

How to get a more then one year visa?

Is my attitude toward learning Japanese correct?

Daily Thread: simple questions, comments that don’t need their own posts, and first time posters go here (March 06, 2023)

Kanji: an annoying but silly problem.

What are some good japanese movies/shows to learn japanese with?

I’m looking for a reliable Japanese word segmentation algorithm

Tags:

2 comments

Leave a Reply

How to get a more then one year visa?

Is my attitude toward learning Japanese correct?

You May Also Like