Introduction

As I have been working on the Classical Chinese primer, one question I have been curious about is how many Chinese Characters (漢字, 한자) should one memorize before embarking on learning Classical Chinese (漢文, 한문). The Korean Hanja Proficiency Exam (漢字能力檢定試驗, 한자능력검정시험) specifies that people should learn at least up till the first rank (一級, 1급), or 3,500 characters, to read Classical Chinese “without difficulty.” This is a somewhat subjective judgment, and depends on how willing a reader is in looking up characters that he does not know while reading Classical Chinese texts. I like to conceptualize this in more mathematical terms:

where p(x) is the probability that reader does not know character x, C(x) is the cost (e.g., time and effort) reader is willing to spend on looking up each character x, and T is the threshold at which reader will “give up” on finding all the characters. The equation as a whole states that a reader will be willing to find the character, as long as the cost and probability of doing so does not exceed the threshold. C(x) and especially T are highly subjective, and depend on the individual reader. p(x), on the other hand, is not. I was interested in seeing how p(x) looked like, and how I could interpret it.

Methodology

I had some downtime over Easter, and decided to code a very short script to determine this. The pseudo-code is very simple, and is as follows:

Load file with Classical Chinese source text.
Remove all the punctuation, spaces, new lines, et cetera.
Count the number of time a particular character occurs in the source text.
Output data.

The Classical Chinese source text chosen was Analects of Confucius, Annotated by Zhu Xi (論語集註, 논어집주). I believed that this was very representative of Classical Chinese texts, as many people learning Classical Chinese at the very least read Analects unannotated.

Data & Analysis

The total number of characters in the Analects is 80,964. There are 2,373 different characters. Sorting from the most frequent to the least, the top 20 most frequent characters are:

Table 1 – Top 20 Most Frequent Characters in Analects

This result should not surprise anyone, as most of these characters serve common grammatical functions and thus are likely to appear quite frequently. For instance, from experience, 也 occurs at the end of sentences very often. It is no surprise that the Table 1 reflects this experience. One curious result was how quickly the frequently dropped: from 3756 with 之(지) to 644 with 矣(의). This is further illustrated in the table below:

Table 2 – Assorted Characters

Table 2 shows a few characters by the order of their frequency in the text. By the most common 150th character, the frequency has dropped into the two digits. By the most common 850th character, the frequency has dropped to just one digit. By the most common 1900th character, the frequency has dropped to 1.

Table 3 – Frequency & Percentage

In Table 3, characters that appear less than 1000 times in the text occur with 74.8%, those that appear less than 250 times occur with 50%, and those that appear less than 50 times, occur with 22.2%.

Conclusion

First, a caveat. This quick, informal analysis does have some weaknesses, particularly with the sample source text chosen. Some characters that are considered as “easy” in resources for learning Chinese Characters surprisingly showed up as occurring very few times in Analects. For instance, 雨(우) (“rain”) only occurred a total of three times in the text. For future analysis, I would like to do add other texts.

As for the relationship between learning Classical Chinese and memorizing Chinese Characters, the data suggest that readers should have an expansive of knowledge of Chinese Characters. Most notably, although characters that appeared less than 10 times or less occur with 6% probability, those that 100 times or less occur with 33% of the time in the text. In general, the less likely the character occurs, the more it is considered “difficult.” I would presume that most readers, who have not yet memorized less frequent characters, would not want to be flipping through their dictionaries one-third of the time while reading through Analects, as this would exceed the threshold cost they are willing to allow. This data, although not perfect, may give an idea where this threshold may be.

kuiwon.wordpress.com

Copyright Notice

This work by Kuiwon is licensed under
a Creative Commons
Attribution-NonCommercial 3.0 Unported License.

A Quick, Informal Statistical Analysis on Analects of Confucius

Copyright Notice

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List