Script for Creating a Chinese Vocab List

The Problem: Let us say you have a list of Chinese words or single Chinese characters in a file. There are a lot of them. You want some easy and fast way of getting the pinyin and English definitions of that list of words or single characters and you want this in a format that can be easily imported into a flashcard program so you can practice these words.

Today I faced this kind of problem. There are lots of “annotator” websites online that make use of the free CEDICT Chinese dictionary but I have yet to find one which outputs a simple, and nicely formated (with all […], and /…/ stuff removed) tab delimited vocab lists.

I have recently been frustrated by the fact that I often come across Chinese characters that I haven’t learn, or, more often, characters that I only know how to pronounce in Japanese or Korean. I also am frustrated at the fact that I have forgotten the tones for a lot of characters I knew well many years ago when I studied Chinese formally.

Over the summer I want to review or learn the 3500 most frequently used Chinese characters, particularly their pronunciation, so that I can improve my tones and more quickly lookup compounds I don’t know.1

I found a few frequency lists online (see here and here for example) and I stripped out the data I didn’t need to create a list with nothing but one character on each line.2 Although it is an older list based on a huge set of Usenet postings from ’93-’94 you can download an already converted list of 3500 characters here.3

Since I’m not in the mood to look up 3500 characters one by one, I spent a few hours this evening using this problem as an excuse to write my second script in the Ruby programming language.

In the remote possibility that others find it useful who are using Mac OS X, you can download the result of my tinkering here:

Cedict Vocabulary List Generator 1.1

This download includes the 2007.8 version of CEDICT, the latest I could find here.4

How this script works:

1. After unzipping the download, boot up the “Convert.app” applescript application. It will ask you to identify the file you want to annotate. It is looking for a text file (not a word or rich text file) in Unicode (UTF-8) format with either simplified or traditional Chinese characters or word compounds, one on each line.

2. This application will then send this information to the convert.rb ruby script which will search for the words in the CEDICT dictionary in the same folder, format the information it finds (the hanzi, pinyin, and English definition), including the putting of multiple hits for the same character/word within the same entry with the definitions numbered. It does not currently add the alternate form of the hanzi (it won’t add simplified version to traditional or vice versa).

3. It will then produce a new file with the word “converted” added to its name. It will create tab-delimited files by default but you can change this by changing this option at the top of the convert.rb file in a text editor.

4. Though this version of the script doesn’t do this yet, you may want to run the resulting text through the Pinyin Tone dashboard widget or a similar online tool such as the one here or here. That will get rid of the syllable final tone numbers and add the appropriate tone marks. I am having a bit of trouble converting the JavaScript that my widget and this site uses into Ruby so if anyone is interested in working on this let me know!

If the script doesn’t work: make sure you are saving your text file as UTF-8 before you convert. I am also having trouble when my script is placed somewhere on a hard disk where the path has lots of spaces. Try putting the script folder on your Desktop.

Note: If you don’t have Mac OS X but can run Ruby scripts on your operating system, you may be able to run my script convert.rb from the command line. It takes this format:

convert.rb /path/to/file.txt /path/to/cedict.u8

UPDATE 1.1: The script now replaces “u:” with “ü” (CEDICT uses u:).

  1. The top 3000 make up some 98-99% when their cumulative frequency is considered. []
  2. A few of the frequency lists I have seen have Cedict dictionary data included but not in a very clean format []
  3. I notice that there is a high frequency of phonetic hanzi for expression emotion in the postings and some other characters one doesn’t come across as often in more formal texts, I actually don’t mind []
  4. If you find a newer version (in UTF-8) put it in the same directory as my script and name it cedict.u8 []

5 thoughts on “Script for Creating a Chinese Vocab List”

  1. I don’t know about Chinese with cedict specifically, but I once wrote a similar script for Japanese with jedict. I found that although automatically generating a list sounded nice in reality it was worth the effort to edit the definitions first according to my preferences. It might be different with Chinese grammar (with Japanese I had to do some editing to actually be able to understand words grammatically), but if cedict’s definitions are a whole jumble of single word translations or phrases like jedict there is something to be said for manually pruning them down.

  2. jhr: There is exactly the same problem with CEDICT and you are right indeed. Since in my case, I’m just reviewing these 3500 for their pronunciation (there are perhaps a thousand for which I have forgotten or haven’t learned the Chinese pronunciation or tone) so the definitions are just an extra for reference.

    I don’t recommend, as many do, trying to learn words for the first time with a full set of definitions from CEDICT or JEDICT – getting them from a vocab book or textbook is a better idea since it chooses the most commonly used and important definitions.

  3. Nathan: Thanks for the suggestion! I’m not sure it has enough characters to bring me to the set of characters I’m trying to review.

  4. I’m looking for a similar program, but the opposite direction – converting a list of english words into chinese with the pinyin etc.

    Does anyone know of this program? Thanks!

    Alex
    xelalw@gmail.com

Comments are closed.