Corpus of Everyday Japanese Conversation: spreadsheet of 30,000 words by frequency, collected from recordings from people’s daily life spoken conversations, which can be sorted by type of speech (chit-chat, meeting, ect), location (at home, school, ect), gender, or age

Here’s info about the corpus in English:
https://www.ninjal.ac.jp/english/research/cr-project/project-3/institute/spoken-language/

A snippet from that page:
“Our project will develop a large-scale corpus of Japanese everyday conversation in a balanced manner. Since informants record their conversations in everyday situations by themselves, naturally occurring conversations can be collected. To build an empirical foundation for the corpus design, we conducted a survey of ordinary conversational behavior of about 250 adults.”

Here’s the website you can download the spreadsheet from (in a zip file), it’s the second link:
https://www2.ninjal.ac.jp/conversation/cejc/cejc-wc.html

File name of the Excel Vocab Frequency List, once you extract the zip: 3\_cejc\_frequencylist\_suw\_token.xlsx

Some machine translated snippets from the Explanatory pdf about the word list (the first download link on the page):

“The Corpus of Everyday Japanese Conversation (CEJC) is a vocabulary and word count table based on 200 hours of recorded data (approximately from April 2016 to 2020).
◆ Attributes of Conversation

Formality

Form: ➢ 「雑談」”Chit-chat” is a conversation in which the purpose and topic of the conversation are not predetermined.

➢「用談・相談」 “Discussion” is a conversation in which the purpose of the conversation is fixed to some extent but the time and place are not specified.

➢「会議・会合」 “Meeting” is a conversation in which the time and place are fixed.

➢「授業・レッスン」 “Lesson” is a conversation during a class or a lesson.
Location:.


The place where the conversation takes place is 「自宅」”home”,「職場」 “work”, 「学校」”school”, 「公共商業施設」”public commercial facilities”, 「交通機関」”transportation”, or 「室内」”indoor” or 「屋外」”outdoor”.

Attributes of the speaker

Gender and age.

➢ “Age” in five-year age increments.
7. Notes on Use

(1) The data can be used freely for research and educational purposes free of charge. No application is required.

(2) Redistribution is not permitted. Consult with us for commercial use (use for profit).”

1 comment
  1. This is great, thank you. Finally a word list based on real life frequency rather than culled from anime or whatever.

    For non-spreadsheet users, there’s also a tab separated file in the zip that will be much easier to parse programmatically.

Leave a Reply
You May Also Like

Why 消滅し in that sentence?

真田家をはじめとする小豪族が跋扈し、弱肉強食のサバイバル戦の果て、あるいは、**消滅し**、あるいは生き残っても、周囲の大国、上杉、武田、北条 徳川、織田にそれぞれ切り従えられた時代だった。 It was a time when the Sanada and other small powerful clans roamed the land, and at…