japanese words frequency list data set

I am looking for parsable japanese vocabulary database (ideally json or even csv) with word frequency. I am using this database which is quite good [https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html](https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html) but I need another one with different corpus to compare the results. Ideally I want the frequency to be a number and not just common/rare etc … I don’t need translation, just word list, frequency + details on the word type like kun/on/jukugo or what ever that help me to categorise the word. I need large dataset (more than 100K words). any good source to share ? (I know the jmdict but frequency is not good and i do not need traduction).

2 comments
  1. It’s quite individual. Personally, I split vocabulary on 4-5 categories:

    * Common everywhere. Typically grammar and vocabulary like pronouns and this list isn’t very big, like 1-3k.
    * Specific to individual areas/genres. Simple example can be slang, but typically there are at least several thousands of words in each area that are much more popular there or do not even appear in other areas. For example, we do not see elves and orcs outside of fantasy settings. There are many related nouns to each genre, but also verbs. There are actions that typically are done in everyday life, but also actions done in specific situations.
    * Rare everywhere, but knows to all natives. I’ve actually seen multiple such words, for example, 親知らず (wisdom tooth) is much more known than more specific medical terms, but in a lot of dictionaries it’s around 60-80k frequency. It’s in several times higher number than average native vocabulary.
    * Rare common words. There are some words that appear in many areas, but it’s just a rare variation of other words. For example, adjectives. Like pretty –> beautiful –> gorgeous. Quite often it gets slight nuance, but it’s still something that could appear in many situations. Frequencies of these words are often around the same as common words from specific genres.
    * Rare everywhere. Words that never appear in common situation and that rarely appear even in specific areas/genres.

    Thus generally after 1-3k most common words it becomes quite specific. It’s very hard to judge individual vocabulary, because there are literally hundreds of thousands of different words and individual experience is like a random blot on it. Because there are many common situations for many natives, there is a big overlap in known words, no matter how frequent it is. Something that happens only once or twice, but literally in everybody’s life, would be very rare, but at the same time known to everybody. And very common experience for one individual might be not even known to another.

    Thus frequency dictionaries usually vary depending on what we use.

Leave a Reply
You May Also Like