Algorithms for computing readability/difficulty of Japanese text

Hi,

I’m writing a web application where one of the key features I need to implement is computing a difficulty score for a text. I’m struggling to find enough information about it. But here’s what I found so far:

1. cb’s Japanese Text Analysis Tool – [https://sourceforge.net/projects/japanesetextana/files/JapaneseTextAnalysisTool\_v5.1/](https://sourceforge.net/projects/japanesetextana/files/JapaneseTextAnalysisTool_v5.1/). The problem here is this is windows based, and I need a linux based program. I read the readme file as well to see how they’re computing the readability score and it mentions Hayashi Score and OBI-2 but both links are broken. Googling doesn’t help much either
2. [jreadability.net](https://jreadability.net) \- This one has a paper [“Readability measurement of Japanese texts based on levelled corpora”](https://researchmap.jp/jhlee/published_papers/21426109/attachment_file.pdf) that describes their algorithm. It used linear regression on JLPT study materials to come up with the following formula:

X = {mean length of sentence \* -0.056} + {proportion of kango \* -0.126} + {proportion of wago \* -0.042} + {proportion of number of verbs among all words \* -0.145} + {proportion of the number of auxiliary verbs \* -0.044} + 11.724

​

* It’s a promising approach since it mentioned: “Firstly, the readability formula we constructed is intended especially for learners of Japanese as a foreign language, whereas many existing formulas such as those by Shibasaki and Hara (2010) and Sato (2011) are intended for native readers of Japanese.”

​

Satoshi Sato’s approach – [http://www.lrec-conf.org/proceedings/lrec2014/pdf/633\_Paper.pdf](http://www.lrec-conf.org/proceedings/lrec2014/pdf/633_Paper.pdf) Seems to used some kind of probabilistic model. I tried reading the paper but not smart enough to follow their methodology. This appears to be the same as OBI-2 from cb’s Japanese Text Analysis Tool. However, link is broken.

​

I think I might go with the jreadability approach, plus a combination of grammar point used, and weighted frequency scoring.

Would love to hear your thoughts if you know of better approaches. Thank you!

​

​

​

3 comments
  1. When I googled around a while ago for english text solutions, it didn’t really seem like there was a definitive solution out there so I assumed any rough difficulty measurement solution would all have some annoying outliers and probably be comparably fine. You would have to implement all of them and run on the same text to even get started in the comparison game.

    Assuming there’s no definitive way to grade text in general, I’d probably prefer a solution gave the option to view view each criteria separately (ie. sentence length compared to other texts + resulting difficulty grade) so the user would be able to decide if they want to trust the overall difficulty score

    I also like the idea of grading texts or presenting stats based on the words / grammar you already know, but that’s a little extra

  2. Seems like a decenti-sh approach, but maybe I’m misunderstanding something, there is no frequency analysis on the words and kanji themselves?

    There almost certainty must be frequency analysis data online, but even there isn’t it’s not impossibly difficult to write the code to do such a thing,

    If I were to write the algorithm I would certainty want to add the following behaviors:

    * All words/kanji are weighted based on their placement in the frequency of all words/kanji.
    * The more uncommon words found in a sentence the higher the output.
    * The more uncommon kanji found in a sentence the higher the output.
    * The more Alternate / Uncommon readings for Kanji the higher the output. (fundamentally hard to do, stretch goal)

    Where a higher output means a harder difficulty.

    Difficulty is somewhat subjective as someone may be better at grammar but worse at kanji or vis-versa. So the ability to adjust the weights dynamically would probably be a valuable feature.

    Finally I think the algorithm proposed perhaps approximates the “complexity” of a sentence rather than the difficulty. This is perhaps a pedantic difference, as complexity will typically approximate difficulty; however I think the difference is a potentially meaningful one.

Leave a Reply
You May Also Like