Here’s info about the corpus in English:
https://www.ninjal.ac.jp/english/research/cr-project/project-3/institute/spoken-language/
A snippet from that page:
“Our project will develop a large-scale corpus of Japanese everyday conversation in a balanced manner. Since informants record their conversations in everyday situations by themselves, naturally occurring conversations can be collected. To build an empirical foundation for the corpus design, we conducted a survey of ordinary conversational behavior of about 250 adults.”
Here’s the website you can download the spreadsheet from (in a zip file), it’s the second link:
https://www2.ninjal.ac.jp/conversation/cejc/cejc-wc.html
File name of the Excel Vocab Frequency List, once you extract the zip: 3\_cejc\_frequencylist\_suw\_token.xlsx
Some machine translated snippets from the Explanatory pdf about the word list (the first download link on the page):
“The Corpus of Everyday Japanese Conversation (CEJC) is a vocabulary and word count table based on 200 hours of recorded data (approximately from April 2016 to 2020).
◆ Attributes of Conversation
Formality
Form: ➢ 「雑談」”Chit-chat” is a conversation in which the purpose and topic of the conversation are not predetermined.
➢「用談・相談」 “Discussion” is a conversation in which the purpose of the conversation is fixed to some extent but the time and place are not specified.
➢「会議・会合」 “Meeting” is a conversation in which the time and place are fixed.
➢「授業・レッスン」 “Lesson” is a conversation during a class or a lesson.
Location:.
➢
The place where the conversation takes place is 「自宅」”home”,「職場」 “work”, 「学校」”school”, 「公共商業施設」”public commercial facilities”, 「交通機関」”transportation”, or 「室内」”indoor” or 「屋外」”outdoor”.
Attributes of the speaker
Gender and age.
➢ “Age” in five-year age increments.
7. Notes on Use
(1) The data can be used freely for research and educational purposes free of charge. No application is required.
(2) Redistribution is not permitted. Consult with us for commercial use (use for profit).”
1 comment
This is great, thank you. Finally a word list based on real life frequency rather than culled from anime or whatever.
For non-spreadsheet users, there’s also a tab separated file in the zip that will be much easier to parse programmatically.