Scrivener for Dissertation Chunk Drafting

A favorite procrastination technique of dissertation writers is to waste time searching for that perfect tool for writing the dissertation in a more efficient manner. I indulged in this sinful habit a bit two nights ago and revisited the software “Scrivener” for OS X. I’m impressed and encourage my fellow PhD students to give it a closer look as a possible environment in which to compose and collect chunks of writing for ‘da diss.’

When I last looked at the software, I didn’t think it had anything striking to offer and seemed like a kind of bizarre combination of clip collection software like Yojimbo1 and the writer’s software Copywrite.2 However, after being prodded to give the software another look by a friend, I now believe there are some features in Scrivener that are worth considering by students or scholars writing longer research papers or one’s dissertation.

Obviously, a simple word processor and citation management software may be best for most dissertation or other academic projects. It may be justly argued that I have hopelessly over-organized my life in the digital medium, with hundreds of note files in OmniOutliner, all my tasks and snippets of ideas stored in OmniFocus, mind maps for my writing ideas in NovaMind, serial numbers, short reference files, and screenshots of webpages stored in Yojimbo, a diary written in MacJournal, thousands of pictures, PDFs, and documents tagged and organized with Leap, various personal data tracked in a Bento database, and flashcards for the various languages I have studied daily reviewed in Anki. And so on. My name is Konrad, and I am an organizational software addict. The irony of this is only truly appreciated by friends of mine who know how disorganized I am.

So why is an application like Scrivener useful for dissertation or research paper writing? Read on for more detail, but if you want the quick and dirty tips on what to look for download the trial and especially consider the following: 1) the corkboard for organizing writing chunks, 2) the “edit scrivenings” for immediately reassembling several writing chunks 3) the multi-level hierarchical writing chunk organization with the possibility for separate synopsis, notes, and tags 4) the “snapshot” feature for document versioning. 5) links for creating internal links between documents or documents and note snippets (not a full personal wiki like the excellent VoodooPad but close) 6) a “Research” dumping ground for various file formats that can serve as a kind of mini-Yojimbo/Evernote 7) A status and label feature for writing chunks. 8) Possibility of two pane viewing for simultaneously editing two chunks or combining a writing chunk with the corkboard view.

It is actually hard to appreciate Scrivener because it is an unusual hybrid. Users of the other applications mentioned above will note many features it has in common with other programs out there.

Most importantly, at its most simple Scrivener is a kind of basic word processor with a beautiful full screen mode that allows you to write without distraction. The full screen mode is highly customizable and a delight to work in.

At the next level, being explicitly designed for writers of larger structured works, it provides an environment in which you may divide and hierarchically organize chunks of writing as one might in outlining software. The “binder” on the left of the screen divides everything into a “Draft” and “Research.” The former contains only text chunks which, at the conclusion of the drafting process may be “compiled” and exported to a word processor. The latter is a place where one may drop snippets and various files such as images and PDFs for you to refer to as you write.

The Binder and Documents – One nice aspect of the files in the binder is that they may be multiple levels deep, like any good outlining software. One’s “draft” may be made up of folders (chapters, for example) which have sections that themselves have sections with sections.

Each section can be viewed as its own document alone or sharing the screen horizontally or vertically with another panel in which you may display another document or, as we shall see an outline or corkboard. These individual documents have a nice word count at the bottom and, via a small target icon, the easy ability to establish a word count target for each section. Also, these individual documents may have their own title, a “synopsis,” as well as “Document Notes,” reference links to other documents, tags (keywords), colored labels, and a customizable status (draft, complete, etc.)

The Outline and the Corkboard – In addition to viewing any document in the binder directly in one of the viewing panes, as one might in a program like Yojimbo, there are two other views. The Outline view displays a list of documents along with their synopsis, labels, and status. I have yet to find this very useful given the appearance of the resulting outline. The corkboard, on the other hand, is one of Scrivener’s best features. Once it has been stripped away, via the preferences, of its silly looking pins, blue lines, and corkboard background appearance (this cheesy look was one of the things that turned me immediately off from Scrivener the first time I downloaded it, but it can be easily removed), this view allows you to view a collection of writing chunks (their title and synopsis) as cards across the screen. Somehow, I find this view much more useful than an outline view. I can order and reorder these large notecards, with their synopsis displayed in one pane, while I read or edit the content of the chunk in question in a second pane below it. The visual juxtaposition of them feels closer to a mind map view and thus stimulates the thinking process in fresh ways (one could always dream of an ultimate application that could seamlessly combine the powers of NovaMind, OmniOutliner, OmniFocus, Zotero/Sente and a writing application like Scrivener or at least allowed a smooth drag and drop relations between elements of these various apps but we ask too much).

Edit Scrivenings – This is a brilliant feature that allows you to experiment putting different chunks together or edit them together as a whole. When you have written several separate chunks of text that are displayed by their title in the “draft” section of your “binder” to the left, you may select several chunks from the list arbitrarily or consecutively, and press the “Edit scrivenings.” This temporarily combines the texts in a pane for you to see them together and allows you to edit them each directly (they are visually distinguishable by a slight variation of background color. Note that you may not edit across two chunks, but only within each chunk separately).

Versioning – One of the features I loved about the software Copywrite was that you could work on a chunk of writing, and then at any time easily save a “version” of it. You could then edit the document at will and easily return to any previously saved version of it, as displayed in a list at the right, not entirely unlike software versioning software. The Scrivener equivalent to this is Snapshot. You can create a snapshot of any chunk of writing and restore it at a later time.

Three Simple Suggestions for the Developer:

1. Sometimes I get stuck in a view and find myself a bit lost, trying to get back to the body of text for a document. This usually happens when I click on a text chunk in the binder and find myself with an empty outline view. The trick is to “deselect” the outline view in the toolbar (or press Command 1 again). It would be better if there was an explicit “Text View” which feels more natural than getting back to the text view by deselecting the outline view.

2. The snapshot feature works great, but I don’t think it belongs only in a separate window at the universal level of the application. It should be, as it is in CopyWrite, displayed at the level of the document or writing chunk. In the “inspector” we can choose between “notes,” “references,” and “keywords” panels – why not add a “snapshots” panel here so that we can immediately see, for any document, what previous snapshots there are for each document here.

3. Allow a view of the corkboard with only the titles of the writing chunks displayed (and not the synopsis as well) and which has a “free” mode to allow full and free movement about of the cards or, even better, rudimentary mind mapping features.

One Power Feature Suggestion

Implementing the following would, I believe, instantly quadruple the value of the application for dissertation writers:

Currently, when you create a new “link” in a text document or chunk for the first time, a new folder appears called “notes” which seem to be something separate and distinct from the normal writing chunk documents in the draft.

This is where my theory of medium level organization for dissertation writing could be perfectly applied if Scriviner strived to expand this “notes” feature a little more.

Here is how this could be done, and you will see this follows from the ideas laid out in the third of my series of postings on the topic:

1. Make the “notes” a much more robust feature-packed section of the Scrivener binder separate and distinct from the writing chunks in the ‘draft’ section of the binder. Allow the user to very easily create hundreds, if not thousands of small notecards which may each be tagged using Scriviner’s keyword feature. Allow them to be attached to a “source” (separate from its tag or keyword) such that all cards can potentially belong to a source and notes deeper in a hierarchy can inherit the “source” of cards higher up the chain. As in the case of “Draft” documents – allow multiple levels of hierarchy and folders for further organization. Allow the inheritance of tags to note cards at lower levels.

2. Allow each of these notes to be linked to writing chunks where the writer wants to deploy them. (this can already be done)

3. Allow the notes to have a status – or more simply a check mark to indicate when the idea or content they have has been incorporated into the main writing.

4. Allows the notecards to be viewed in the “corkboard” mode or ideally assembled in a more visually complex form (ie. mind maps)

5. Allow easy creation of “smart outlines” (See my post for an explanation of this)

6. Allow easy access to a list of “sources” – ideally connected in some relational way to an external citation management software.

One Difficult Challenge:

This is a great environment to bang out quick chunks of writing for the dissertation, but despite the fact there is a simple inline footnote feature, many dissertation writers will want to do their footnoting as they write that first draft and, if they use a citation manager such as Zotero, Sente, or Endnote, this will mean that they will want to do even their drafts directly in an application which can interface with these applications (Word for Endnote, Word or OpenOffice for Zotero, or Word, Apple Pages, and Mellel for native support with Sente). For those writers, Scriviner will never be able to sufficiently draw them in.

For the rest of us who don’t mind revisiting this process after getting a good draft going, you can draft up a chapter in Scrivener, making simple notes to yourself with the Scrivener inline footnote feature and then add the real citations with your favorite citation software after you “compile” the draft into a word processor document of the desired format.

  1. Evernote, Together, Devon Think, and other examples abound, but Yojimbo is my favorite []
  2. Journler, MacJournal, StoryMill, and WriteRoom are also tools for writers I have looked at which have various strengths[]

Legal MP3 Downloads in China via Google

I listened to a great podcast recently of a Columbia University SIPA sponsored talk by Kai-Fu Lee on Google’s many different efforts to compete in the China market (Find their China site easily at g.cn). One of the things Lee mentioned was the initial difficulty of competing with the MP3 downloads available, often illegally, through Baidu, their now scandal plagued competitor.

Google went into some kind of licensing agreement with Chinese music distributors and now provides download of a lot of Chinese music in an even easier fashion than that of their competitors. The service, however, is only provided to Chinese users with a Chinese IP address to avoid cannibalizing the music industry’s income outside of the mainland where illegal music download is, unlike China, somewhere below 100% of the available market.

I have to say, having now used this Google China service, I’m very impressed. This is really like the old Napster days back with a vengeance, at least for Chinese music – but this time it is actually legal!

Here is a step by step demonstration of how one gets the MP3 of a song in China through Google:

1. Search for the song’s name. Google Suggest, unlike the US, is on by default in China because, as Kai-Fu Lee says, “Typing Chinese is hard.”

search.jpg

2. If google recognizes the search item as a song it knows, above other web links for the given search, you will get album art and a series of special music links, including direct links for listen (试听), download [as MP3] (下载), link for the artist, etc.

result.jpg

3. Click on the download link, and a pop-up window results, showing the size of the file, its format (MP3), and a big green download button. There is also a banner advertisement, where I presume some of the revenue is generated for the music industry.

download.jpg

Click the download button, and you will soon have a downloaded 192kbps quality MP3 of 许巍’s song 难忘的一天, complete with lyrics. Unfortunately, the encoding of the metadata is not Unicode so it doesn’t show up correctly in iTunes, but it is easy enough to copy/paste this info from the Google download window.

The ease of this process really impressed me. Google China and the Chinese music industry are way ahead of the game here. I don’t know if they have found this distribution mechanism to be profitable but from the user’s perspective, this is really hard to beat.

Endnote Takes A Shot at Zotero

The cold war between Endnote, the bibliographic software owned by Thomson Reuters that has long had a virtual monopoly on the academic market, and Zotero, the open source alternative created by the incredibly resourceful and innovative Center for History and New Media at George Mason University has finally broke out into an open conflict.

Endnote clearly saw its grip on the academic market coming to a swift end as a new generation of graduate students embrace the free and powerful Firefox browser-based alternative that has rapidly caught up to its rival in features. It responded with a huge gamble and an ancient weapon: the lawsuit. It has sued George Mason University for being in violation of its site license for Endnote. GMU has paid for a site license for the Endnote software, much like other universities (I can confirm, for example, Columbia and Harvard’s internal university software sites also provide its download for their university community) and the CHNM at GMU is listed as the creator of Zotero in the software’s about information. The Endnote site license is said to have explicitly forbidden the license holder from engaging in the “reverse engineering, de-compiling, translation, modification, distribution, broadcasting, dissemination, or creation of derivative works from the [EndNote] Software.”

Lets look a bit closer at the players and the issues.

What is Endnote?

Endnote is a piece of software which allows researchers in any field to compile a list of bibliographic entries. This might mostly include lists of books or articles they have come across for use in their publications.

At its core, the software is simply a database client for research sources. However, it eventually developed three killer features that created a reluctant customer base out of virtually the entire academic world:

1) Z39.50 – In Endnote, the user doesn’t have to type in all their sources by hand. If, for example, they want to include a book which was found in the Library of Congress or any one of thousands of libraries which have an online database which supports something called the Z39.50 protocol they can use Endnote to directly import the info in question. Endnote ships with dozens of “.enz” connection files which allow it to connect to most of the important libraries in the United States and search their holdings for the source required. Endnote will then add the bibliographical information directly into the user’s own database. If you can’t find your library in the default list of connections, very often the Z39.50 .enz file can be downloaded directly from your favorite library’s homepage, usually hidden somewhere deep in the geekier sections of the website. The .enz files simply contain connection information, openly available through various library websites, that has been put into a special format readable to Endnote. Interestingly for this lawsuit, I don’t know of any case in which Endnote has sued libraries for distributing (which is a violation of the license) these .enz files which are, like .ens files (see below), a “component part” of the software.

2) Styles – Endnote provides the ability to convert one’s source entries into any bibliographical style, so that your footnotes, endnotes, and bibliographies can be easily formatted according to the many different styles used by various journals and publisher needs. These styles are created and openly available to anyone who consults the website of the given publication. In addition to providing the ability to create your own output style, Endnote has simply taken these publicly available style formats, many based on well known formats like the Chicago citation style (see instructions for citation styles for American Historical Review, for example, here), reduced them to their most basic components and created an “.ens” file which saves the formatting requirements in a digital format. If you have Endnote installed, you can see the huge list of style files available in your Endnote folder in the Styles sub-folder:

ens files.gif

If you open any of these files in a text editor you will get mostly gibberish, as the information is stored in format readable only (until recently) by the Endnote software. However, if you open Endnote’s style manager and inspect, for example, the style for the American Historical Review, under Bibliography templates, you will see some of the kind of information stored by the .ens file. For example, under book template you will see something like this:

Author. Title|. Translated by Translator|. Edited by Series Editor|. Edition ed|. Number of Volumes vols|. Vol. Volume|, Series Title|. City|: Publisher|, Year|.

Each of those words corresponds to a variable, or a kind of an empty box, into which Endnote will drop your bibliographical information, in accordance with what you have entered into the database with your sources. It is important to understand, for the purposes of this first battle of the E vs. Z war, that the styles themselves are not proprietary, but Endnote lawyers are arguing that the way they have translated these styles into a digital format, that is the “.ens” file, is protected by the Endnote license.

3) Word Integration – The final killer feature of Endnote is that the software can take your list of formatted footnotes, endnotes, or bibliography and directly interface with the most popular word processor out there: Microsoft Word. If a scholar is writing a paper in Word, they can prepare an Endnote document with all the sources they need for the publication, and directly in word they could assign certain sources to certain footnotes or the bibliography using a Word plugin provided by Endnote. They can then, with a few clicks, format all of those footnotes, endnotes, and the bibliography to the style appropriate for whatever publication they are submitting the paper to.

For thousands of scholars this ability has saved hundreds of hours they might otherwise spend typing up their references and making sure it conforms to the requirements of their publisher.

However, as a side note, this hasn’t been all good. I can share from my own experience and the experience of my friends some of the most problematic issues:

a) Garbage in – Garbage out: The library databases that most users of Endnote interface with don’t always have perfect information. Sometimes information is in the wrong place, lacking capitals where it needs them, or contains a lot of surfeit information that one doesn’t want to include in every footnote. Users must often spend a lot of time cleaning up imported information before having Endnote (or Zotero for that matter) do its magic. This is a problem of data integrity, not the fault of the software.

b) Endnote sucks. We used it because, until the rise of alternatives like RefWorks and Zotero, that is all there was. I’m sorry, but since the earliest version I started using years ago until the most recent version Endnote seems to have thrived in an environment of safety and lack of competition. For many years Endnote could not deal with any sources that used non-Roman scripts, mangling any Chinese, Japanese, Korean sources such as those I have need for. To this day, I have encoding issues with Endnote that makes it a pain to use. Endnote has a user interface that seems to have been designed by programmers that have never written a paper in their life, let alone studied user interface techniques. It is ugly, clunky, and unintuitive at every step. Finally, Endnote has long had serious stability and performance issues when it interfaces with Word. Though I haven’t personally had any major disasters, only minor hiccups caught early in the process, during my tech support days at Columbia University’s Faculty Desktop Support, I have had to deal with many panicking professors who showed me their book or article manuscript Word files with completely mangled footnotes. “All my references suddenly disappeared!” or “No matter what I click in Endnote, nothing converts or changes in my Word file anymore!” were two of the most common complaints I had. Sometimes the tenuous connection between Endnote and Word just seem to breakdown, with disastrous consequences.

c) Endnote only works with Microsoft Word. At least as far as I know in the versions I have used. This created a vicious circle within academia. At FDS I watched more and more professors who loved their ancient alternatives to Word like WordPerfect and Notabene (I had never heard of this until I saw its grip on Classics and English departments), or who stubbornly resisted Microsoft’s power by using OpenOffice or Apple’s AppleWorks having to switch to Word not only because .doc was the dominant format but sometimes because they watched with envy as others used the power of Endnote for large scale pieces.

The Rise of Zotero

Zotero will go down as one of the great open source legends. Unlike many other wonderful pieces of open source software, I believe Zotero is poised to completely topple its commercial rival, Endnote, and do so in record time. Zotero has and will continue to have other powerful competitors who askew the browser-based approach or embed a browser into the software, but the rule of Endnote is soon at its end. I have played with Zotero since its buggy early beta days and watched it grow to the powerful alternative to Endnote that it is today. Developed by and for the browser generation it took a radically different starting point: Endnote users started their bibliography creation process within the Endnote software: typing up or using Z39.50 connections to add sources to their bibliography. Zotero users start on the net, because hey, guess what, we all do.

Zotero assumes we find the majority of our sources while, for example, using a library’s search engine, a list of books on Amazon.com, an article at JSTOR or other academic databases, or when reading a blog entry. Zotero has gradually added a huge list of “site translators” which scrape a web page and extract the useful bibliographical information from the page in question. There are plugins to add metadata readable by Zotero in popular blog engines like WordPress. Whether it is a library book entry or a bookstore listing, Zotero can instantly add information from hundreds of websites and databases available online by simply clicking an icon in the address bar. You can also instantly add bibliographic entries from any static web page, and save offline snapshots of these websites from the time you accessed them for future reference. This all meant that Zotero very quickly far overtook Endnote’s main killer feature #1. It was an instant feature smack-down.

Because the project is free and open source, it quickly gained a huge following even when it lacked some of Endnote’s power. Those without access to a university site license were loath to dish out the ridiculous $300 for Endnote ($210 for an academic license) or face its steep learning curve and were willing to accept cheaper alternatives like Bookends (Mac, $100, $70 for students) or the increasingly powerful Sente (Mac, $130 or $90 for students). Zotero, of course, is completely free. Plugins and site translators for Zotero have spread fast as a result. It also offered powerful tagging capabilities and the easy organization of sources into folders, which is way ahead of the incredibly limited organizational possibilities of Endnote’s file-based bibliography system. The only major weakness in Zotero’s general approach is the fact it is wed to the Firefox browser so researchers may have to do their source hunting in something other than their favorite internet browser.

I think the most powerful attack on Endnote’s market came, however, when Zotero added support for Word, OpenOffice, and NeoOffice integration. Although I think the results have been somewhat mixed in the early stages (I haven’t tried in the newest release) this will eventually eliminate the advantage of Endnote’s killer feature #3.

All that remained before Endnote became an expensive 175MB waste of space on one’s hard drive was for Zotero to catch up with Endnote’s killer feature #2. Now, Zotero’s 1.5 Sync Preview which is available for download as a beta, includes (though this has been temporarily disabled, perhaps because of the lawsuit) the ability to export Zotero database entries using Endnote .ens style files. I’m not 100% sure how this works on a technical basis since I haven’t played with a functioning version including the feature, but the text of the Thomson Reuters lawsuit against GMU claims that Zotero now also provides a way for .ens files to be converted into the .csl style files that Zotero has. I have seen some comments on blogs that claim that the new version of Zotero never provided this ability directly but merely provides a way to output bibliographic data exported via existing .ens files should the user be in possession of such Endnote files. Either way, the developers of Zotero must have engaged in some kind of reverse engineering (which is where the lawsuit claims there is a license violation) of the gibberish we otherwise see in the .ens files in order to understand how Endnote has digitally represented the publicly available output styles and is therefore now in possession of the ability to, for example, convert the Zotero database data, through these .ens files, into a readable bibliographical entry, or if it wanted to, save such style formatting data into .csl files if that feature were ever included.

The War Was Over Before It Began

I think we have to await the official Zotero announcement regarding the lawsuit to help us determine the accuracy of the technical claims being made by Thomson Reuters. An entirely separate question, which has received the attention of various technology oriented law bloggers, is the strength of the approach of the legal attack itself and its separate and bizarre claim GMU is responsible for a misuse of Endnote’s trademark.

What isn’t in dispute, however, is the fact that Endnote should be very very scared. Whatever features are included in 1.5 or later versions, the developers of Zotero have clearly made sense of the .ens files and suddenly the thousands of output styles provided by Endnote might potentially become importable, exportable, or more likely, simply accessible and readable by the Zotero software. Once these publicly available style formats become digitally understood by Zotero’s database, by whatever means, Endnote loses its last and final advantage over Zotero. This will, in my mind, undoubtedly be followed by the slow death of Endnote, already begun, as new users see no advantage to using the flawed aging piece of software with its huge price tag.

The outcome of this lawsuit, even if it goes in favor of Endnote, cannot really do much to stop this trend. Zotero isn’t going to disappear. Even if, and I find this to be extremely unlikely, GMU were to take the radical step of completely shutting down its support for Zotero development, the user base is already huge. Other programmers will pick up where GMU’s team began with the code already in their hands. The reverse engineering of the .ens format, if it has been done successfully, can probably be explained in the space of a few paragraphs or represented by means of a few pages of code, perhaps encapsulated as a plugin that can be distributed separately from the Zotero software itself. The knowledge of a file format’s structure, once in the wild, can’t be put back in the proverbial bottle, a reality faced by dozens of software applications in the past and something we have seen with everything from Microsoft’s .doc to various proprietary image, sound, or movie file formats. Once the .ens output style files, which are all under 50k in size can be interpreted, it is a simple matter, though of dubious legality, for scholars and students to email each other the dozen or so .ens files of journals or institutions most important for their field either in the original format or, if the feature is eventually made available, converted into .csl files.

I believe that, whatever the outcome of the lawsuit, Endnote’s owner has shot itself in the foot. Users like myself do not like to be locked into one solution and when we see a free and open source alternative under attack, it is an easy matter for all of us to jump in and identify the “good guys” and the “bad guys” to paraphrase one recent politician. Endnote is in an unenviable position. It saw Zotero’s latest move as the final straw in its attack on the Endnote user base and decided the legal move was its last chance to halt the bleeding by protecting one of the most important components of its legacy code: the .ens output styles. Strategically, they have made the wrong move and I think all of us who agree should make our voice heard. It would have been far better for Endnote developers to at least attempt to out-innovate Zotero, something very hard to do when your opponent’s staff of supporting developers includes the wider community of open source developers along with solid university and foundation funding. Instead they have given Zotero a brilliant publicity moment.

Update: The official response by Zotero and GMU about the case. Nature magazine editorial on the issue.

Further Reading

Text of the Lawsuit (PDF)

Chronicle of Higher Education Wired Campus article on the Lawsuit
Outline on Disruptive Library Technology Jester
More Extracts and Discussion at Disruptive Library Technology Jester
Crooked Timber entry by Henry on the Lawsuit
James Grimmelmann Legal Commentary
More Legal Comments at Discourse.net
Mike Madison at Madisonian Offers a Legal Take
Mention and Comments at Slashdot

The Open Source CSL Format

Script for Creating a Chinese Vocab List

The Problem: Let us say you have a list of Chinese words or single Chinese characters in a file. There are a lot of them. You want some easy and fast way of getting the pinyin and English definitions of that list of words or single characters and you want this in a format that can be easily imported into a flashcard program so you can practice these words.

Today I faced this kind of problem. There are lots of “annotator” websites online that make use of the free CEDICT Chinese dictionary but I have yet to find one which outputs a simple, and nicely formated (with all […], and /…/ stuff removed) tab delimited vocab lists.

I have recently been frustrated by the fact that I often come across Chinese characters that I haven’t learn, or, more often, characters that I only know how to pronounce in Japanese or Korean. I also am frustrated at the fact that I have forgotten the tones for a lot of characters I knew well many years ago when I studied Chinese formally.

Over the summer I want to review or learn the 3500 most frequently used Chinese characters, particularly their pronunciation, so that I can improve my tones and more quickly lookup compounds I don’t know.1

I found a few frequency lists online (see here and here for example) and I stripped out the data I didn’t need to create a list with nothing but one character on each line.2 Although it is an older list based on a huge set of Usenet postings from ’93-’94 you can download an already converted list of 3500 characters here.3

Since I’m not in the mood to look up 3500 characters one by one, I spent a few hours this evening using this problem as an excuse to write my second script in the Ruby programming language.

In the remote possibility that others find it useful who are using Mac OS X, you can download the result of my tinkering here:

Cedict Vocabulary List Generator 1.1

This download includes the 2007.8 version of CEDICT, the latest I could find here.4

How this script works:

1. After unzipping the download, boot up the “Convert.app” applescript application. It will ask you to identify the file you want to annotate. It is looking for a text file (not a word or rich text file) in Unicode (UTF-8) format with either simplified or traditional Chinese characters or word compounds, one on each line.

2. This application will then send this information to the convert.rb ruby script which will search for the words in the CEDICT dictionary in the same folder, format the information it finds (the hanzi, pinyin, and English definition), including the putting of multiple hits for the same character/word within the same entry with the definitions numbered. It does not currently add the alternate form of the hanzi (it won’t add simplified version to traditional or vice versa).

3. It will then produce a new file with the word “converted” added to its name. It will create tab-delimited files by default but you can change this by changing this option at the top of the convert.rb file in a text editor.

4. Though this version of the script doesn’t do this yet, you may want to run the resulting text through the Pinyin Tone dashboard widget or a similar online tool such as the one here or here. That will get rid of the syllable final tone numbers and add the appropriate tone marks. I am having a bit of trouble converting the JavaScript that my widget and this site uses into Ruby so if anyone is interested in working on this let me know!

If the script doesn’t work: make sure you are saving your text file as UTF-8 before you convert. I am also having trouble when my script is placed somewhere on a hard disk where the path has lots of spaces. Try putting the script folder on your Desktop.

Note: If you don’t have Mac OS X but can run Ruby scripts on your operating system, you may be able to run my script convert.rb from the command line. It takes this format:

convert.rb /path/to/file.txt /path/to/cedict.u8

UPDATE 1.1: The script now replaces “u:” with “ü” (CEDICT uses u:).

  1. The top 3000 make up some 98-99% when their cumulative frequency is considered. []
  2. A few of the frequency lists I have seen have Cedict dictionary data included but not in a very clean format []
  3. I notice that there is a high frequency of phonetic hanzi for expression emotion in the postings and some other characters one doesn’t come across as often in more formal texts, I actually don’t mind []
  4. If you find a newer version (in UTF-8) put it in the same directory as my script and name it cedict.u8 []

Fool’s Flashcard Review

A long time ago, in the last millennium, I designed a flashcard application for Mac OS that implemented something I called interval study (known elsewhere as spaced repetition or the Leitner method). I sold and later gave away the software at a website I created for my software tinkering called the Fool’s Workshop. I used the software every day for my own Chinese language study and I acquired a few fans before I abandoned development of the software when OS X came out. I also listed some of the other applications for Macintosh that I found online and reviewed some of them on the website and was surprised to find that this page is still riding high in the Google rankings for a number of different search terms.

I currently use iFlash for my vocabulary review. I’m particularly partial to iFlash because its developer was one of two who implemented interval study in a way that is almost identical to my old Flashcard Wizard application. I am always interested in the development going on around the web of similar kinds of software, and like an old timer telling war stories on his porch when he wasn’t really ever much of a soldier to start with, I again feel like sharing my thoughts on some of these applications.

To this end, I have created a new weblog over at the old Fool’s Workshop website:

Fool’s Flashcard Review

Here I will occasionally post reviews of flashcard software, to begin with mostly for Mac OS X, and I will especially focus those applications which attempt to implement some kind of interval study. My goal is to give language learners a resource to compare what is out there but even more importantly, to hopefully reach some of the developers who are working on this kind of software and convince them that these applications need to have certain basic features to be useful to those of us using their software to learn and maintain the languages we have studied, especially when we are away from the native language environment.

Watching US Online Media Outside the US

I logged on to see if I could watch part of the debate in Texas between Clinton and Obama. The debate, I believe, was partly sponsored by CNN. I tried to view the live feed on CNN but was given a message that is all too familiar to those of us outside of the United States.

cnn.gif

Various online media providers sniff out your location from your IP address and block your access to online media. This is how Netflix prevents me from watching movies online through my membership when overseas, how various programs now online through the websites of various US television channels cannot be viewed outside the US, how BBC blocks access to their regular programs usually accessible online to visitors outside the UK, and CNN blocks live streaming of the US presidential primary debate in Texas.

Thanks to the technological art of IP location sniffing, traditional and new media have found another powerful way of rebuilding national borders online. I guess I will have to wait until someone uploads clips of the debate onto youtube and try to view them before they get taken down for violating the copyright on this US presidential debate held by CNN and others.

In Korea, the media have taken a different approach: Just ask everyone for their citizen registration number. Since I am here on an A-3 US government visa, I cannot even get a foreigner registration number in Korea. That means, when I am living in Korea on a one year visa, in addition to not being able reserve train tickets and use the vast majority of the thousands of online retailers and websites, I can’t view any (that I know of) of the Korean television media streamed or archived online.

In short, in Korea I cannot use the internet to see Korean online commercial media and I cannot use it to see major sources of online media in the United States. Fortunately, there is a reason I have never used my TV since beginning my current fellowship (and it isn’t the fact that I recently discovered that the TV in the furnished apartment may never have been working in the first place): this helps reduce the distractions to my studies to that last minor source: the rest of the internet.

UPDATE: CNN blocks the video feed to everyone outside the US but an audio feed is available here. HT dailykos.com.

Yep and my PDF Jungle

I finally declared war on my PDF organizing system. I am struggling to manage some 2400 or so PDF files on my computer. This huge number of files consists of downloaded or scanned journal articles, newspaper articles, historical documents, PhD dissertations, books, and various personal documents that I got sick of dragging around the world in fat disorganized folders.

This week I tried to find a dissertation I had read part of which talked about the relationship between liberalism in Japanese domestic politics under the premiership of Hara Kei and the aftermath of the March 1st movement in colonial Korea.

Did I file this PDF away in the Academic Papers/Korea/ folder, the Academic Papers/Japan/ folder, the Documents/To Read/ folder, the Docs/Dissertation/MUST READ/ folder, or was it still stranded in the downloads folder? Apparently none of these and I still haven’t been able to find the damn file. My folder system is an embarrassing mess.

However, in the 21st century, where tagging rules, why should my folder system matter? Why can’t I tag the stupid files and be done with it. If each PDF file can have a dozen tags that I could easily search through later. The file I described above, for example, could be tagged as “Academic Papers, Korea, Japan, colonial period, Taisho, liberalism, political, Hara Kei, March 1st Movement, dissertations”. Well, in order to do this I broke down and paid for the PDF indexing software Yep (Mac only, I’m sure there is something similar out there for those unfortunate Windows users out there).

I feel better now and have already made serious progress. With tag clouds, smart folders, and an iTunes like interface in Yep, I’m hoping I will gradually overcome my jumbled mess of PDF files and therefore be able to write my own dissertation in no time. Or not, but at least I feel less like I’m hunting for a document in a bombed out archive. I hope that future versions of the software will allow the option of including not only .pdfs but also images since scanning documents saved as PDFs is much slower than taking pictures of documents. I have thousands of crisp high contrast black and white photos of historical documents and newspaper articles that I would love to be able to tag in the same way without having to do it in a separate piece of software.

There are half a dozen other programs out there like Yojimbo, DEVON products, etc. which also allow you to store PDF files in a database of various kinds of data that might also include regular text, images, and so on. However, what I don’t like about them is that they are 1) often slow to import the PDFs 2) they are usually importing the PDFs into the program’s database thus swelling the size of the DB and slowing down its overall performance. I am not really interested in having over 30,000 pages of PDFs all inside the DB of a program and would like to keep them scattered in various places on my hard drive only to be indexed by a program by Yep.

Japanese Dictionaries on Leopard

After my family acquired a “family pack” license, I installed the Mac OS X Leopard operating system on my machine last night and everything went smoothly with the installation. I did a clean install on a new larger hard drive and migrated over my user files using the migration assistant. Things seem to be going fine with a few free updates here and there except for my older Macromedia apps (Dreamweaver 8 and Fireworks MX) which I won’t be able to afford upgrading.

I suppose that some of the smaller things about Leopard will either grow on me or annoy me with time. The only thing I have gotten really excited about so far, however, is the improved “Dictionary” application. It now has four Japanese dictionaries: 大辞泉, プログレッシブ英和・和英中辞典, and 類語例解辞典. They are not shown by default (in the English version of the OS, I assume they are default on Japanese language installations) and you need to activate them in the Dictionary application’s preferences.

I usually use my portable electronic dictionary for J-J, J-E, and E-J (plus C-J, J-C, and Kanji dictionary) and can always look words up various places online. Asahi.com’s dictionary site has 大辞林(国語辞典), エクシード英和, and エクシード和英. Yahoo Japan’s Dictionary site has both 大辞泉 and 大辞林 as well as the same プログレッシブ dictionaries Apple has licensed. However, it is wonderful to have all this accessible offline write on my mac.

Some snapshots, click image for larger version:

大辞泉:

J-J-2

プログレッシブ英和・和英中辞典:

E-J-E

類語例解辞典:

Thes

I wish they had included the front and back matter for these dictionaries, as they did for the English language dictionary in the new version of the application, with all its interesting reference information.

I also really hope some day that the China and Korea markets will become important enough to Apple that they will consider licensing dictionaries in those languages.

Of Knols, Trolls, and Goblins

Google recently announced its new Knol Project. Quite a number of news articles and many more blog postings have appeared to comment on the launch of the new project.

I’m rather puzzled by a lot of concerns shown by some whose writing on similar issues I usually admire. Further down in this posting, I will respond to some of the critiques of Crooked Timber‘s John Quiggin made in his posting Knols, wikis and reality and if:book‘s Ben Vershbow rough notes on knols.

This new Google project in some ways reminds me of that other competitor of Wikipedia that rarely seems to get mentioned, Everything2. Like the new Knol project, articles at Everything2 are written by single authors and can be rated by community members. There are even Google ads. Like the Knol project, there can be many articles on a given topic, which vary widely in content, length, quality, and often offer completely different kinds of material on similar topics.

It also reminds me of a software project I started designing a few years ago but never got around to writing up (funny how a PhD program can get in the way of one’s amateur programming projects). My plan was to create a history knowledge-base which contained contributed articles, all under a Creative Commons or other similar license, which were rated by the community of readers and which competed directly with other contributed articles on similar topics. The number of points any reader could give was a function of their own “value” in the community as judged by the aggregate point value of their own contributions (in the form of articles and comments). This was not to be pure democracy but a tyranny of meritocracy – a huge difference with Wikipedia but similar in some ways to Everything2. In my own system, the currently “winning” article would be the most prominently listed or displayed article on a topic but might always be replaced with a new better article. The most important feature was this: since all future writers on a topic were free to copy/steal any amount of any previous article new articles could, like Wikipedia articles are supposed to, be small incremental improvements of any previous article. However, unlike Wikipedia but like Everything2, I also wanted to design the system so it encouraged “new narratives” and completely fresh approaches to old topics.

By contrast, in Wikipedia if you decide to completely rewrite a popular and controversial entry on the Nanjing Massacre, which you certainly have the power to do and I have been tempted to do, the chances are your efforts will be completely wasted as you newly written article is completely reverted to whatever chaotic and inconsistent mess prevailed before your arrival. Thus, hidden in the long list of revisions on any popular wikipedia article might lurk alternative narratives that can still be viewed, but only if they are looked for by patient visitors to the site.

Wikipedia is at its core an Enlightenment project.

Its god, NPOV (Neutral Point of View), the very core of its being, is a myth. The policy requires “that where multiple or conflicting perspectives exist within a topic each should be presented fairly” and that views be presented “without bias.” NPOV is a useful myth, and not one that we should spend too much time mocking (especially those of us aspiring to professionalism in academic life), but we should always be conscious of its limits. I think every 6th grade elementary school student of the future should be given an exercise wherein they are given the opportunity to discover how any controversial Wikipedia article one might pick, no matter how well written, not only completely violates NPOV but can never hope to achieve anything remotely close to NPOV. NPOV is impossible. The greatest theoretical challenge of the post-Enlightment world is, “How do we deal with that?”

I think that we must have a strong competitor to Wikipedia which is based on the fundamental idea that we need competing narratives, we need them juxtaposed, we need them competing with each-other, and we need the ability to monitor their changes and popularity across time so that we don’t completely become slaves of the present. This doesn’t mean we have to completely abandon the incremental approach and the amazing power of building upon the work of others, but also allow for easy access to competing approaches to a problem in a single tidy, convenient, and familiar interface. Despite some key innovations, projects like Everything2 have failed to challenge Wikipedia. My own abandoned ideas for a project were half-baked and I have no time to spend in the kitchen.

So what about Google Knol? All we have seen of what it might become is in this single screenshot. It is surely a little early to judge.

John Quiggin of the wonderful group weblog Crooked Timber has looked at the sample article from the screenshot and is not happy with its author centered approach:

As regards simple factual statements anyone is likely to care about, I’d rather go with Wikipedia than with an individually written article, even one by an expert. Wikipedia will usually have a citation, and, if there are conflicting claims, report them. With an individual author, it’s much harder to tell if a given statistic is generally agreed to be accurate and representative of the situation.

I find this really hard to understand. A friend of mine, now a professor in his field, used to help edit dozens of articles related to pre-modern Chinese history before he abandoned it in exhaustion. I really want to like Wikipedia – there is a kind of “storm the Bastille” kind of excitement in its democratic vision. Yet, in the end, having read through dozens of Wikipedia talk pages where my friend battled desperately against irrational and, unfortunately, completely ignorant voices, I see that quite often it is completely mistaken “simple factual statements,” of the kind Quiggin is speaking of, including those which get a citation, which get inserted by contributors that have little or no access to good materials, no training in judging their sources, and no knowledge of context. The sad reality is that for many topics, the rational, knowledgeable, and in many simple cases the accurate contributions get drowned out in talk pages by voices that are either more numerous or which have more idle time to dedicate to the “edit wars” that can result. I really can’t understand why a mass edited Wikipedia article with citations will win by default over an article written by an expert. Will either have a monopoly on good research? Certainly no. Will the latter always use the best data or come to the correct conclusion? Of course not. But an author based approach does not have any inherent weaknesses that outweigh similar inherent weaknesses of the average Wikipedia article.

if:book is one of my favorite weblogs that discusses the future of reading, writing, narration, and the technologies that go with them. Ben Vershbow has posted some of his notes on the Knol project.

Vershbow has a lot of concerns, beginning with the term “knol” which he says is “possibly the worst Internet neologism in recent memory.” I am actually quite fond of it, it reminds me of similar wards like “node” and other single syllable words familiar to programmers that are used to represent single atomic units of something. It can hardly qualify as the worst, a position which I believe is still safely held by the word “blog.”

Vershbow points out some of the features of the knol project which I think are commendable and which resemble some of the best ideas out there: 1) Anyone can write 2) Multiple knols can compete on a single topic 2) Readers can rate the articles 3) a “Darwinian writers’ market where the fittest knols rise to the top.

This sounds a lot like what I had imagined for a CMS but I think the key would be a license that would allow any future or competing writers to use any or all of previous knols to build better articles.

One of Vershbow’s main concerns, which he shares with Anil Dash of Six Apart, is that Google is suffering from a kind of lack of “theory of mind” – an inability to understand the contradiction between what it is: a large profit-run corporation whose profits are intricately connected to the kind of content its searches produce, and its altruistic dreams.

While I share with Vershbow and other Google critics a whole host of complaints about Google projects such as Google books, which I have on occasion gone into some length here at Muninn, I am a bit surprised at critiques like this which seem to attack Google’s new projects almost on principle. He also has deep worries for the future when knol articles might come to displace untainted non-Google articles in the search results.

It is not so much that I disagree with Vershbow’s deep suspicions about Google or pessimism about the role of mammoths like Google in both being a host of content (Youtube, Google Books, Knol) and the most popular manager and ranker of metadata about such content, since I’m sure I can be persuaded with good arguments.

It is the complete lack of confidence in the contributors of content, in the authors, experts, and web users of the future. I think Google’s hegemony is limited and requires our continued complicity. The knol project doesn’t lock content in, as far as I understand it, especially if users can choose their own licenses.

Finally, Vershbow, like Quiggin, has doubts about the author-centric nature of the project.

The basic unit of authorial action in Wikipedia is the edit. Edits by multiple contributors are combined, through a complicated consensus process, into a single amalgamated product. On Google’s encyclopedia the basic unit is the knol. For each knol (god, it’s hard to keep writing that word) there is a one to one correspondence with an individual, identifiable voice. There may be multiple competing knols, and by extension competing voices (you have this on Wikipedia too, but it’s relegated to the discussion pages).

Vershbow intelligently withholds final judgment on whether this author based approach, similar to Larry Sanger’s Citizendium, will work out but raises many doubts:

I wonder… whether this system will really produce quality. Whether there are enough checks and balances. Whether the community rating mechanisms will be meaningful and confidence-inspiring. Whether self-appointed experts will seem authoritative in this context or shabby, second-rate and opportunistic. Whether this will have the feeling of an enlightened knowledge project or of sleezy intellectual link farming (or something perfectly useful in between).

I think he is right to have such doubts, but could we not raise a whole host of similar questions about Wikipedia, the tool which know even its most hostile detractors around me use on a daily basis? Ultimately, Vershbow is inclined to trust Wikipedia, which “wears its flaws on its sleeve” and works for a “higher aim.” Google’s project, after all, is born in sin, tainted as it is by its capitalist origins.

My own feeling is that as long as the content is not locked in, signed away to Google, we shouldn’t conflate the sinner with the products of her collaborating contributors. This is a great time to test a (at least in some ways) new model for knowledge sharing.

I still believe this new approach would stand the best chance of making an improvement over existing alternatives if it was more dictatorial in one respect: that all contributions should be released with some license which requires a minimum level of permission for sharing – so that future competing writers of knols can either provide fresh competing articles, or, at some or all sections, quickly and easily lift and modify chunks of earlier knols, perhaps with due attribution accessible somewhere from the Knol’s page. That would allow it to combine the best of Wikipedia’s collaborative approach, with the benefits of author-based control.

Fixing Garbled Tags for Korean and Chinese Songs in iTunes

The song name, artist, and album tags in many music files (whether they are acquired legally or otherwise) from Chinese and Korean sources are completely garbled in iTunes on a Macintosh. I assume this is because iTunes assumes that the text is one encoding (Unicode or MacRoman?) and they were in fact encoded in another (often EUC_KR for Korean, Big5 for Taiwanese files, GB for files with simplified Chinese characters). I used to frequently get this problem with Japanese music files but for some reason (perhaps because Unicode is more popular in Japan?) this has gradually become less of a problem.

Fixing these tags can be a pain and some of the older tools such as once awesome “MP3 Rage” and “ID3 Editor” often make things worse due to their inconsistent handling of 2-byte non-Roman languages.

An Apple Support page, however, recently pointed me to a great shareware application ($12) called ID3Mod2 which looks like it is made by the same people that made the incredible Chinese input method QIM that I talked about in an earlier posting (I don’t know this developer personally so it is not as if I’m trying to find good things to say about their work). You can freely use the software for a number of days, during which I was able to go through and fix all of the garbled tags in music files I have collected in China, Korea, and Japan over the last decade. Amazing – I might now actually learn the names of some of the songs I have been listening to for so long and someday even gather the courage to request them on a future karaoke adventure.