Good Old Chinese Word Frequency

On a recent trip to Taiwan I picked up a copy of James Erwin Dew’s 6000 Chinese Words: A Vocabulary Frequency Handbook for Chinese Language Teachers and Students. 杜老師, as we knew the author, was a former director of the IUP Chinese language program in Beijing, where I studied for a year (it is also known as IUB now and was originally in Taiwan. Sayaka is currently studying at the successor to IUP in Taiwan, now called ICLP). He still came to the center fairly frequently while I was there and I occasionally chatted with him about technology and language learning. He also designed the Easytone pinyin font which I host for him, and provided me with some files that helped me make my Pinyin to Unicode Converter.

I just read through the introduction to the book’s wonderful collection of reference charts and lists of word and character frequency. In comparing a mainland Chinese frequency dictionary, the BLI (现代汉语频率词典) with the data from studies by Academia Sinica in Taiwan, he notes a few terms which have a very marked difference in frequency ranking (p20). The word 同志 (comrade) is the 86th most frequent term in the mainland China study, while it has a ranking of 6,619 in the Taiwanese data. The mainland China data ranking for 戰鬥 (Simplified Version: 战斗) meaning ‘fight or combat’ and 錯誤 (错误) meaning ‘error or mistake’ was also very different from that of the Taiwanese data.

The most amusing, however, was the fact that in the mainland Chinese frequency data, the word 敵人 (敌人) or ‘enemy’ was ranked 168th most frequent, while the word was nowhere to be found in the first 5000 terms of the Academia Sinica materials. This would have made a great propaganda poster at the 2/28 Hand-in-Hand rally in Taiwan I went to see during which many were protesting China’s ‘aggressive’ and ‘belligerent’ behavior towards Taiwan.

I should note, however, that Du laoshi does mention that the data is somewhat old so these rankings would have changed over the years. The BLI dictionary was published in 1986.

On a separate note, I am pondering (together with my 20 other projects yet to get off the ground) the idea of making a Chinese equivalent to my Jii-chan Kanji flashcard review site using a portion of the word frequency data in this book. Any volunteers to help me input some data or who already have a digital version of something similar? I don’t think lists like this frequency data can be copyrighted, and indeed the book makes no reference to getting permission from BLI to reprint their data.