The Problem: Let us say you have a list of Chinese words or single Chinese characters in a file. There are a lot of them. You want some easy and fast way of getting the pinyin and English definitions of that list of words or single characters and you want this in a format that can be easily imported into a flashcard program so you can practice these words.
Today I faced this kind of problem. There are lots of “annotator” websites online that make use of the free CEDICT Chinese dictionary but I have yet to find one which outputs a simple, and nicely formated (with all […], and /…/ stuff removed) tab delimited vocab lists.
I have recently been frustrated by the fact that I often come across Chinese characters that I haven’t learn, or, more often, characters that I only know how to pronounce in Japanese or Korean. I also am frustrated at the fact that I have forgotten the tones for a lot of characters I knew well many years ago when I studied Chinese formally.
Over the summer I want to review or learn the 3500 most frequently used Chinese characters, particularly their pronunciation, so that I can improve my tones and more quickly lookup compounds I don’t know.1
I found a few frequency lists online (see here and here for example) and I stripped out the data I didn’t need to create a list with nothing but one character on each line.2 Although it is an older list based on a huge set of Usenet postings from ’93-’94 you can download an already converted list of 3500 characters here.3
Since I’m not in the mood to look up 3500 characters one by one, I spent a few hours this evening using this problem as an excuse to write my second script in the Ruby programming language.
In the remote possibility that others find it useful who are using Mac OS X, you can download the result of my tinkering here:
How this script works:
1. After unzipping the download, boot up the “Convert.app” applescript application. It will ask you to identify the file you want to annotate. It is looking for a text file (not a word or rich text file) in Unicode (UTF-8) format with either simplified or traditional Chinese characters or word compounds, one on each line.
2. This application will then send this information to the convert.rb ruby script which will search for the words in the CEDICT dictionary in the same folder, format the information it finds (the hanzi, pinyin, and English definition), including the putting of multiple hits for the same character/word within the same entry with the definitions numbered. It does not currently add the alternate form of the hanzi (it won’t add simplified version to traditional or vice versa).
3. It will then produce a new file with the word “converted” added to its name. It will create tab-delimited files by default but you can change this by changing this option at the top of the convert.rb file in a text editor.
If the script doesn’t work: make sure you are saving your text file as UTF-8 before you convert. I am also having trouble when my script is placed somewhere on a hard disk where the path has lots of spaces. Try putting the script folder on your Desktop.
Note: If you don’t have Mac OS X but can run Ruby scripts on your operating system, you may be able to run my script convert.rb from the command line. It takes this format:
convert.rb /path/to/file.txt /path/to/cedict.u8
UPDATE 1.1: The script now replaces “u:” with “ü” (CEDICT uses u:).
- The top 3000 make up some 98-99% when their cumulative frequency is considered. [↩]
- A few of the frequency lists I have seen have Cedict dictionary data included but not in a very clean format [↩]
- I notice that there is a high frequency of phonetic hanzi for expression emotion in the postings and some other characters one doesn’t come across as often in more formal texts, I actually don’t mind [↩]
- If you find a newer version (in UTF-8) put it in the same directory as my script and name it cedict.u8 [↩]