Hi guys, I’m trying to tokenize japanese sentence in order to give my users the best experience and i’m currently hitting a wall:
For the sentence “捕まえろ-何人いる?”
Chatgpt gives :
“捕まえろ” – This is a Japanese verb meaning “catch” or “arrest”.
“-” – This is a punctuation mark used as a separator.
“何人” – This is a Japanese term meaning “how many people”.
“いる” – This is a Japanese verb meaning “to be” or “to exist”, often used when referring to people.
“?” – This is a question mark, indicating that the sentence is a question.
But I’m trying to use javascript libraries, one is native, the other is a package :
The native one gives me:
\[“捕ま”, “えろ”, “何人”, “いる”\].
​
So it seems like it split the first word.
And the [https://github.com/WaniKani/WanaKana](https://github.com/WaniKani/WanaKana) package gives this:
\[“捕”, “まえろ”, “何人”, “いる”\].
So it split the first word on the first symbol, and then the 3 others.
I’m honestly a bit lost and I think the only one that got it right is chatgpt but I won’t tell you that this is bad news since it is an external service that you need to pay for etc…
Anyway thanks in advance guys !
​
by Jaedong9