Introduction
As I have been working on the Classical Chinese primer, one question I have been curious about is how many Chinese Characters (漢字, 한자) should one memorize before embarking on learning Classical Chinese (漢文, 한문). The Korean Hanja Proficiency Exam (漢字能力檢定試驗, 한자능력검정시험) specifies that people should learn at least up till the first rank (一級, 1급), or 3,500 characters, to read Classical Chinese “without difficulty.” This is a somewhat subjective judgment, and depends on how willing a reader is in looking up characters that he does not know while reading Classical Chinese texts. I like to conceptualize this in more mathematical terms:
where p(x) is the probability that reader does not know character x, C(x) is the cost (e.g., time and effort) reader is willing to spend on looking up each character x, and T is the threshold at which reader will “give up” on finding all the characters. The equation as a whole states that a reader will be willing to find the character, as long as the cost and probability of doing so does not exceed the threshold. C(x) and especially T are highly subjective, and depend on the individual reader. p(x), on the other hand, is not. I was interested in seeing how p(x) looked like, and how I could interpret it.
Methodology
I had some downtime over Easter, and decided to code a very short script to determine this. The pseudo-code is very simple, and is as follows:
- Load file with Classical Chinese source text.
- Remove all the punctuation, spaces, new lines, et cetera.
- Count the number of time a particular character occurs in the source text.
- Output data.
The Classical Chinese source text chosen was Analects of Confucius, Annotated by Zhu Xi (論語集註, 논어집주). I believed that this was very representative of Classical Chinese texts, as many people learning Classical Chinese at the very least read Analects unannotated.
Data & Analysis
The total number of characters in the Analects is 80,964. There are 2,373 different characters. Sorting from the most frequent to the least, the top 20 most frequent characters are:
This result should not surprise anyone, as most of these characters serve common grammatical functions and thus are likely to appear quite frequently. For instance, from experience, 也 occurs at the end of sentences very often. It is no surprise that the Table 1 reflects this experience. One curious result was how quickly the frequently dropped: from 3756 with 之(지) to 644 with 矣(의). This is further illustrated in the table below:
Table 2 shows a few characters by the order of their frequency in the text. By the most common 150th character, the frequency has dropped into the two digits. By the most common 850th character, the frequency has dropped to just one digit. By the most common 1900th character, the frequency has dropped to 1.
In Table 3, characters that appear less than 1000 times in the text occur with 74.8%, those that appear less than 250 times occur with 50%, and those that appear less than 50 times, occur with 22.2%.
Conclusion
First, a caveat. This quick, informal analysis does have some weaknesses, particularly with the sample source text chosen. Some characters that are considered as “easy” in resources for learning Chinese Characters surprisingly showed up as occurring very few times in Analects. For instance, 雨(우) (“rain”) only occurred a total of three times in the text. For future analysis, I would like to do add other texts.
As for the relationship between learning Classical Chinese and memorizing Chinese Characters, the data suggest that readers should have an expansive of knowledge of Chinese Characters. Most notably, although characters that appeared less than 10 times or less occur with 6% probability, those that 100 times or less occur with 33% of the time in the text. In general, the less likely the character occurs, the more it is considered “difficult.” I would presume that most readers, who have not yet memorized less frequent characters, would not want to be flipping through their dictionaries one-third of the time while reading through Analects, as this would exceed the threshold cost they are willing to allow. This data, although not perfect, may give an idea where this threshold may be.