Can we trust Tatoeba sentences? My attempt at a rating system.

**TL;DR** If you are using any of the free dictionary apps out there, chances are, the sample sentences all come from the same source: [Tatoeba Project](https://tatoeba.org/). Unfortunately, as you might already know, there’s a catch. Since anyone can freely contribute to Tatoeba, the quality of the sentences and translations can vary widely according to their author’s fluency.

That’s why I’ve been working on a **5-star rating system** to help identify how trustworthy a sentence is. The rating is inferred from factors such as its author’s language skills and the difficulty of the its vocabulary. You can see it in action in my [app](https://app.kanjiverse.com/) or read the original [blog post](https://blog.kanjiverse.com/can-we-trust-tatoeba-sentences/) with screenshots.

Now let’s delve deeper into the issue with Tatoeba and its Japanese corpus in particular.

### Who is using Tatoeba sentences?

I am not going to single out any one app, I have tried a dozen of the most popular and they are all using Tatoeba as their source. If you are not paying $20\~40 for a *brand* dictionary, it is almost certain that it is using the *trifecta*: [KANJIDIC](http://www.edrdg.org/wiki/index.php/KANJIDIC_Project) for the kanji, [JMdict](http://www.edrdg.org/jmdict/edict.html) for the definitions, and Tatoeba for the sentences. Kanjiverse is no exception… for now 😉

### How are Tatoeba sentences compiled?

Tatoeba is a collection of sentences and translations contributed by volunteers under free to reuse licenses (CC BY 2.0 FR or CC0 1.0). The corpus was built and is constantly updated as follow:

* Anyone is free to submit new sentences or provide translations to existing sentences.
* Each contributor has a profile page where they can list the languages they know and how fluent they are on a scale from 0 to 5.
* Members can review, tag, and comment on any sentence.

### What are the caveats of the Tatoeba corpus?

While Tatoeba can be a great source of free learning material, there are many pitfalls one should be aware of:

* Anyone can submit a sentence even if they are not a native speaker of that language, they might have made up the sentence themselves or copied it from an unreliable source without knowing if the sentence sounds natural to a true native.
* Anyone can translate a sentence even if their skills are limited in either or both the source and target languages.
* Translations can be indirect – they are translations of other translations – increasing the odds of drifting further away from its original sense.
* Contributors self-assess their language skills and might overestimate their abilities or not even specify their level.
* Most of the Japanese sentences came from the [Tanaka Corpus](http://www.edrdg.org/wiki/index.php/Tanaka_Corpus) and owing to the way it was compiled – by students translating textbook sentences – it contains a large number of unnatural sentences and unreliable translations, see [this article](https://www.manythings.org/corpus/warning.html) for more details about the issue.

### Can we assess the trustworthiness of a sentence?

Although most apps do not include any of those informations, the [Tatoeba website](https://tatoeba.org/) does provide us with some ways to alleviate the shortcomings listed above:

* Sentences have a link to their author’s profile where we can see if they are native (5 stars) or fluent (4 stars) in that language.
* Indirect translations are marked as such.
* Anonymously published sentences can be *adopted* by a contributor who can proofread them and confirm if they are natural-sounding.
* Sentences can be marked with an “OK” tag to indicate that they have been reviewed.

### What is the current state of the corpus?

Here are some statistics I have collected from a recent snapshot (August 6, 2022):

* Number of Japanese sentences: 227,532
* Tanaka Corpus sentences: 148,983 (65% of all sentences!)
* Sentences with non anonymous author: 122,152
* Sentences with self-assessed user level: 114,456
* Sentences whose author is fluent or native: 108,816
* Sentences whose author is native: 105,396
* Sentences marked as OK: 1,666
* All English translations: 250,252
* Direct English translations: 163,902
* Translations with non anonymous author: 62,166
* Translator is fluent in English: 52,176
* Translator is at least intermediate in English and Japanese: 10,246
* Translator is at least advanced in English and Japanese: 8,349
* Translator is fluent in English and Japanese: 4,177

Out of 250,252 pairs of Japanese/English sentences, **only 3,705 (1.6%)** have both author and translator fluent in Japanese and English. 🤯

### How is Kanjiverse’s trustworthiness score calculated?

Each sentence and translation is attributed a 5-star rating based on the following criteria:

* The self-reported **fluency of the author**.
* The self-reported **fluency of the translator** in both the source and target languages.
* The **difficulty of the sentence –** inferred from the rarity of its vocabulary – so that translating an easy sentence would not require the translator to be fluent.
* Other factors such as whether the sentence was reviewed or not, is original or a translation of another sentence, the translation is direct or indirect, etc.

### Want to see it in action?

All sentences in Kanjiverse are tagged with this 5-star rating so you can instantly assess the quality of a sentence like [this one](https://app.kanjiverse.com/tatoeba/10760549).

Further details, such as the author’s level, can be enabled in [Tatoeba Sentence Page Settings](https://app.kanjiverse.com/account/settings/tatoeba).

The selection of sample sentences displayed under the definition of a Japanese word like [this one](https://app.kanjiverse.com/kotoba/%E6%98%A0%E7%94%BB%3A%E3%81%88%E3%81%84%E3%81%8C) can be customized in [Tatoeba Sentences Card Settings](https://app.kanjiverse.com/account/settings/tatoebas) where you can filter them out by difficulty, rating, and author’s fluency.

### One last disclaimer…

Please be reminded that Kanjiverse is **still in beta** and so is this rating system, it might overestimate or underestimate the quality of a sentence. Ratings were not reviewed individually by a human and only attributed by the algorithm based on the informations available.

I hope I did not offend any of the Tatoeba contributors with unwarranted bad ratings \^\^; If you think your sentence has been unfairly attributed a low rating, please contact me so I can rectify it. I also encourage contributors that have not specify their language levels on their Tatoeba profile to do so 😉

Let me know what you think, do you use the Tatoeba sentences from your dictionary? do you think this rating is helpful? or do you prefer to avoid them altogether and do your sentence mining directly from trusted sources?

2 comments
  1. I advise learners to use monolingual corpora generated from native media. I personally use https://kotobank.jp/ in addition to my personal subs2srs sentence bank. Trust in the correctness of the sentences is to me of utmost importance which is why I avoid tatoeba AND ESPECIALLY context.reverso.

  2. Can’t really comment on the quality of the Tatoeba sentences because I don’t really use them, but if anyone is looking for other free sources there are plenty out there.

    * I mostly use the MacOS dictionaries (J-J, J-E) which have example sentences in both dictionaries.
    * Renshuu sentences are free and comes mostly from games. There are also user submitted sentences but you will know which ones are user-submitted. By the way, there is **already a rating system in Renshuu** for helpfulness of sentences.
    * Wanikani has example sentences for all of its vocabulary that is free to access. Just google “wanikani + {word}”. Like [this](https://www.wanikani.com/vocabulary/%E9%81%A9%E5%BD%93).
    * You could also grab text dumps yourself from your games and books, or search for words in them in online databases like [this](https://trailsinthedatabase.com/) and [this](https://fedatamine.com/ja/).
    * Sentence specific sites from native material such as [this](https://sentencesearch.neocities.org/), [this](https://massif.la/), and [this](https://youglish.com/japanese).

    In my opinion, I think the quality of sentences are effected by familiarity, context, ease, brevity and many other factors that isn’t easily evaluated. I’ll take a sentence from a beginner who paraphrased it from a popular anime scene over a native speaker’s blessed sentence any day.

Leave a Reply
You May Also Like