Needing help tokenizing japanese sentence for my learning app


Hi guys, I'm trying to tokenize japanese sentence in order to give my users the best experience and i'm currently hitting a wall:

For the sentence "捕まえろ-何人いる?"

Chatgpt gives :

"捕まえろ" – This is a Japanese verb meaning "catch" or "arrest".
"-" – This is a punctuation mark used as a separator.
"何人" – This is a Japanese term meaning "how many people".
"いる" – This is a Japanese verb meaning "to be" or "to exist", often used when referring to people.
"?" – This is a question mark, indicating that the sentence is a question.

But I'm trying to use javascript libraries, one is native, the other is a package :

The native one gives me:

["捕ま", "えろ", "何人", "いる"].

So it seems like it split the first word.

And the https://github.com/WaniKani/WanaKana package gives this:

["捕", "まえろ", "何人", "いる"].

So it split the first word on the first symbol, and then the 3 others.

I'm honestly a bit lost and I think the only one that got it right is chatgpt but I won't tell you that this is bad news since it is an external service that you need to pay for etc…

Anyway thanks in advance guys !

by Jaedong9

Leave a Reply
You May Also Like