Tech


China and Language and Tech and Workshop22 Jun 2008 02:24 pm

The Problem: Let us say you have a list of Chinese words or single Chinese characters in a file. There are a lot of them. You want some easy and fast way of getting the pinyin and English definitions of that list of words or single characters and you want this in a format that can be easily imported into a flashcard program so you can practice these words.

Today I faced this kind of problem. There are lots of “annotator” websites online that make use of the free CEDICT Chinese dictionary but I have yet to find one which outputs a simple, and nicely formated (with all [...], and /…/ stuff removed) tab delimited vocab lists.

I have recently been frustrated by the fact that I often come across Chinese characters that I haven’t learn, or, more often, characters that I only know how to pronounce in Japanese or Korean. I also am frustrated at the fact that I have forgotten the tones for a lot of characters I knew well many years ago when I studied Chinese formally.

Over the summer I want to review or learn the 3500 most frequently used Chinese characters, particularly their pronunciation, so that I can improve my tones and more quickly lookup compounds I don’t know.1

I found a few frequency lists online (see here and here for example) and I stripped out the data I didn’t need to create a list with nothing but one character on each line.2 Although it is an older list based on a huge set of Usenet postings from ‘93-’94 you can download an already converted list of 3500 characters here.3

Since I’m not in the mood to look up 3500 characters one by one, I spent a few hours this evening using this problem as an excuse to write my second script in the Ruby programming language.

In the remote possibility that others find it useful who are using Mac OS X, you can download the result of my tinkering here:

Cedict Vocabulary List Generator 1.1

This download includes the 2007.8 version of CEDICT, the latest I could find here.4

How this script works:

1. After unzipping the download, boot up the “Convert.app” applescript application. It will ask you to identify the file you want to annotate. It is looking for a text file (not a word or rich text file) in Unicode (UTF-8) format with either simplified or traditional Chinese characters or word compounds, one on each line.

2. This application will then send this information to the convert.rb ruby script which will search for the words in the CEDICT dictionary in the same folder, format the information it finds (the hanzi, pinyin, and English definition), including the putting of multiple hits for the same character/word within the same entry with the definitions numbered. It does not currently add the alternate form of the hanzi (it won’t add simplified version to traditional or vice versa).

3. It will then produce a new file with the word “converted” added to its name. It will create tab-delimited files by default but you can change this by changing this option at the top of the convert.rb file in a text editor.

4. Though this version of the script doesn’t do this yet, you may want to run the resulting text through the Pinyin Tone dashboard widget or a similar online tool such as the one here or here. That will get rid of the syllable final tone numbers and add the appropriate tone marks. I am having a bit of trouble converting the JavaScript that my widget and this site uses into Ruby so if anyone is interested in working on this let me know!

If the script doesn’t work: make sure you are saving your text file as UTF-8 before you convert. I am also having trouble when my script is placed somewhere on a hard disk where the path has lots of spaces. Try putting the script folder on your Desktop.

Note: If you don’t have Mac OS X but can run Ruby scripts on your operating system, you may be able to run my script convert.rb from the command line. It takes this format:

convert.rb /path/to/file.txt /path/to/cedict.u8

UPDATE 1.1: The script now replaces “u:” with “ü” (CEDICT uses u:).

  1. The top 3000 make up some 98-99% when their cumulative frequency is considered. []
  2. A few of the frequency lists I have seen have Cedict dictionary data included but not in a very clean format []
  3. I notice that there is a high frequency of phonetic hanzi for expression emotion in the postings and some other characters one doesn’t come across as often in more formal texts, I actually don’t mind []
  4. If you find a newer version (in UTF-8) put it in the same directory as my script and name it cedict.u8 []
Print This Post
Links and Tech and Thoughts28 Apr 2008 09:46 am

A long time ago, in the last millennium, I designed a flashcard application for Mac OS that implemented something I called interval study (known elsewhere as spaced repetition or the Leitner method). I sold and later gave away the software at a website I created for my software tinkering called the Fool’s Workshop. I used the software every day for my own Chinese language study and I acquired a few fans before I abandoned development of the software when OS X came out. I also listed some of the other applications for Macintosh that I found online and reviewed some of them on the website and was surprised to find that this page is still riding high in the Google rankings for a number of different search terms.

I currently use iFlash for my vocabulary review. I’m particularly partial to iFlash because its developer was one of two who implemented interval study in a way that is almost identical to my old Flashcard Wizard application. I am always interested in the development going on around the web of similar kinds of software, and like an old timer telling war stories on his porch when he wasn’t really ever much of a soldier to start with, I again feel like sharing my thoughts on some of these applications.

To this end, I have created a new weblog over at the old Fool’s Workshop website:

Fool’s Flashcard Review

Here I will occasionally post reviews of flashcard software, to begin with mostly for Mac OS X, and I will especially focus those applications which attempt to implement some kind of interval study. My goal is to give language learners a resource to compare what is out there but even more importantly, to hopefully reach some of the developers who are working on this kind of software and convince them that these applications need to have certain basic features to be useful to those of us using their software to learn and maintain the languages we have studied, especially when we are away from the native language environment.

Print This Post
Open Access and Tech and Thoughts21 Feb 2008 05:02 am

I logged on to see if I could watch part of the debate in Texas between Clinton and Obama. The debate, I believe, was partly sponsored by CNN. I tried to view the live feed on CNN but was given a message that is all too familiar to those of us outside of the United States.

cnn.gif

Various online media providers sniff out your location from your IP address and block your access to online media. This is how Netflix prevents me from watching movies online through my membership when overseas, how various programs now online through the websites of various US television channels cannot be viewed outside the US, how BBC blocks access to their regular programs usually accessible online to visitors outside the UK, and CNN blocks live streaming of the US presidential primary debate in Texas.

Thanks to the technological art of IP location sniffing, traditional and new media have found another powerful way of rebuilding national borders online. I guess I will have to wait until someone uploads clips of the debate onto youtube and try to view them before they get taken down for violating the copyright on this US presidential debate held by CNN and others.

In Korea, the media have taken a different approach: Just ask everyone for their citizen registration number. Since I am here on an A-3 US government visa, I cannot even get a foreigner registration number in Korea. That means, when I am living in Korea on a one year visa, in addition to not being able reserve train tickets and use the vast majority of the thousands of online retailers and websites, I can’t view any (that I know of) of the Korean television media streamed or archived online.

In short, in Korea I cannot use the internet to see Korean online commercial media and I cannot use it to see major sources of online media in the United States. Fortunately, there is a reason I have never used my TV since beginning my current fellowship (and it isn’t the fact that I recently discovered that the TV in the furnished apartment may never have been working in the first place): this helps reduce the distractions to my studies to that last minor source: the rest of the internet.

UPDATE: CNN blocks the video feed to everyone outside the US but an audio feed is available here. HT dailykos.com.

Print This Post
Tech20 Feb 2008 04:07 am

I finally declared war on my PDF organizing system. I am struggling to manage some 2400 or so PDF files on my computer. This huge number of files consists of downloaded or scanned journal articles, newspaper articles, historical documents, PhD dissertations, books, and various personal documents that I got sick of dragging around the world in fat disorganized folders.

This week I tried to find a dissertation I had read part of which talked about the relationship between liberalism in Japanese domestic politics under the premiership of Hara Kei and the aftermath of the March 1st movement in colonial Korea.

Did I file this PDF away in the Academic Papers/Korea/ folder, the Academic Papers/Japan/ folder, the Documents/To Read/ folder, the Docs/Dissertation/MUST READ/ folder, or was it still stranded in the downloads folder? Apparently none of these and I still haven’t been able to find the damn file. My folder system is an embarrassing mess.

However, in the 21st century, where tagging rules, why should my folder system matter? Why can’t I tag the stupid files and be done with it. If each PDF file can have a dozen tags that I could easily search through later. The file I described above, for example, could be tagged as “Academic Papers, Korea, Japan, colonial period, Taisho, liberalism, political, Hara Kei, March 1st Movement, dissertations”. Well, in order to do this I broke down and paid for the PDF indexing software Yep (Mac only, I’m sure there is something similar out there for those unfortunate Windows users out there).

I feel better now and have already made serious progress. With tag clouds, smart folders, and an iTunes like interface in Yep, I’m hoping I will gradually overcome my jumbled mess of PDF files and therefore be able to write my own dissertation in no time. Or not, but at least I feel less like I’m hunting for a document in a bombed out archive. I hope that future versions of the software will allow the option of including not only .pdfs but also images since scanning documents saved as PDFs is much slower than taking pictures of documents. I have thousands of crisp high contrast black and white photos of historical documents and newspaper articles that I would love to be able to tag in the same way without having to do it in a separate piece of software.

There are half a dozen other programs out there like Yojimbo, DEVON products, etc. which also allow you to store PDF files in a database of various kinds of data that might also include regular text, images, and so on. However, what I don’t like about them is that they are 1) often slow to import the PDFs 2) they are usually importing the PDFs into the program’s database thus swelling the size of the DB and slowing down its overall performance. I am not really interested in having over 30,000 pages of PDFs all inside the DB of a program and would like to keep them scattered in various places on my hard drive only to be indexed by a program by Yep.

Print This Post
Language and Tech25 Dec 2007 11:18 am

After my family acquired a “family pack” license, I installed the Mac OS X Leopard operating system on my machine last night and everything went smoothly with the installation. I did a clean install on a new larger hard drive and migrated over my user files using the migration assistant. Things seem to be going fine with a few free updates here and there except for my older Macromedia apps (Dreamweaver 8 and Fireworks MX) which I won’t be able to afford upgrading.

I suppose that some of the smaller things about Leopard will either grow on me or annoy me with time. The only thing I have gotten really excited about so far, however, is the improved “Dictionary” application. It now has four Japanese dictionaries: 大辞泉, プログレッシブ英和・和英中辞典, and 類語例解辞典. They are not shown by default (in the English version of the OS, I assume they are default on Japanese language installations) and you need to activate them in the Dictionary application’s preferences.

I usually use my portable electronic dictionary for J-J, J-E, and E-J (plus C-J, J-C, and Kanji dictionary) and can always look words up various places online. Asahi.com’s dictionary site has 大辞林(国語辞典), エクシード英和, and エクシード和英. Yahoo Japan’s Dictionary site has both 大辞泉 and 大辞林 as well as the same プログレッシブ dictionaries Apple has licensed. However, it is wonderful to have all this accessible offline write on my mac.

Some snapshots, click image for larger version:

大辞泉:

J-J-2

プログレッシブ英和・和英中辞典:

E-J-E

類語例解辞典:

Thes

I wish they had included the front and back matter for these dictionaries, as they did for the English language dictionary in the new version of the application, with all its interesting reference information.

I also really hope some day that the China and Korea markets will become important enough to Apple that they will consider licensing dictionaries in those languages.

Print This Post
Reading and Tech and Thoughts21 Dec 2007 08:25 pm

Google recently announced its new Knol Project. Quite a number of news articles and many more blog postings have appeared to comment on the launch of the new project.

I’m rather puzzled by a lot of concerns shown by some whose writing on similar issues I usually admire. Further down in this posting, I will respond to some of the critiques of Crooked Timber’s John Quiggin made in his posting Knols, wikis and reality and if:book’s Ben Vershbow rough notes on knols.

This new Google project in some ways reminds me of that other competitor of Wikipedia that rarely seems to get mentioned, Everything2. Like the new Knol project, articles at Everything2 are written by single authors and can be rated by community members. There are even Google ads. Like the Knol project, there can be many articles on a given topic, which vary widely in content, length, quality, and often offer completely different kinds of material on similar topics.

It also reminds me of a software project I started designing a few years ago but never got around to writing up (funny how a PhD program can get in the way of one’s amateur programming projects). My plan was to create a history knowledge-base which contained contributed articles, all under a Creative Commons or other similar license, which were rated by the community of readers and which competed directly with other contributed articles on similar topics. The number of points any reader could give was a function of their own “value” in the community as judged by the aggregate point value of their own contributions (in the form of articles and comments). This was not to be pure democracy but a tyranny of meritocracy - a huge difference with Wikipedia but similar in some ways to Everything2. In my own system, the currently “winning” article would be the most prominently listed or displayed article on a topic but might always be replaced with a new better article. The most important feature was this: since all future writers on a topic were free to copy/steal any amount of any previous article new articles could, like Wikipedia articles are supposed to, be small incremental improvements of any previous article. However, unlike Wikipedia but like Everything2, I also wanted to design the system so it encouraged “new narratives” and completely fresh approaches to old topics.

By contrast, in Wikipedia if you decide to completely rewrite a popular and controversial entry on the Nanjing Massacre, which you certainly have the power to do and I have been tempted to do, the chances are your efforts will be completely wasted as you newly written article is completely reverted to whatever chaotic and inconsistent mess prevailed before your arrival. Thus, hidden in the long list of revisions on any popular wikipedia article might lurk alternative narratives that can still be viewed, but only if they are looked for by patient visitors to the site.

Wikipedia is at its core an Enlightenment project.

Its god, NPOV (Neutral Point of View), the very core of its being, is a myth. The policy requires “that where multiple or conflicting perspectives exist within a topic each should be presented fairly” and that views be presented “without bias.” NPOV is a useful myth, and not one that we should spend too much time mocking (especially those of us aspiring to professionalism in academic life), but we should always be conscious of its limits. I think every 6th grade elementary school student of the future should be given an exercise wherein they are given the opportunity to discover how any controversial Wikipedia article one might pick, no matter how well written, not only completely violates NPOV but can never hope to achieve anything remotely close to NPOV. NPOV is impossible. The greatest theoretical challenge of the post-Enlightment world is, “How do we deal with that?”

I think that we must have a strong competitor to Wikipedia which is based on the fundamental idea that we need competing narratives, we need them juxtaposed, we need them competing with each-other, and we need the ability to monitor their changes and popularity across time so that we don’t completely become slaves of the present. This doesn’t mean we have to completely abandon the incremental approach and the amazing power of building upon the work of others, but also allow for easy access to competing approaches to a problem in a single tidy, convenient, and familiar interface. Despite some key innovations, projects like Everything2 have failed to challenge Wikipedia. My own abandoned ideas for a project were half-baked and I have no time to spend in the kitchen.

So what about Google Knol? All we have seen of what it might become is in this single screenshot. It is surely a little early to judge.

John Quiggin of the wonderful group weblog Crooked Timber has looked at the sample article from the screenshot and is not happy with its author centered approach:

As regards simple factual statements anyone is likely to care about, I’d rather go with Wikipedia than with an individually written article, even one by an expert. Wikipedia will usually have a citation, and, if there are conflicting claims, report them. With an individual author, it’s much harder to tell if a given statistic is generally agreed to be accurate and representative of the situation.

I find this really hard to understand. A friend of mine, now a professor in his field, used to help edit dozens of articles related to pre-modern Chinese history before he abandoned it in exhaustion. I really want to like Wikipedia - there is a kind of “storm the Bastille” kind of excitement in its democratic vision. Yet, in the end, having read through dozens of Wikipedia talk pages where my friend battled desperately against irrational and, unfortunately, completely ignorant voices, I see that quite often it is completely mistaken “simple factual statements,” of the kind Quiggin is speaking of, including those which get a citation, which get inserted by contributors that have little or no access to good materials, no training in judging their sources, and no knowledge of context. The sad reality is that for many topics, the rational, knowledgeable, and in many simple cases the accurate contributions get drowned out in talk pages by voices that are either more numerous or which have more idle time to dedicate to the “edit wars” that can result. I really can’t understand why a mass edited Wikipedia article with citations will win by default over an article written by an expert. Will either have a monopoly on good research? Certainly no. Will the latter always use the best data or come to the correct conclusion? Of course not. But an author based approach does not have any inherent weaknesses that outweigh similar inherent weaknesses of the average Wikipedia article.

if:book is one of my favorite weblogs that discusses the future of reading, writing, narration, and the technologies that go with them. Ben Vershbow has posted some of his notes on the Knol project.

Vershbow has a lot of concerns, beginning with the term “knol” which he says is “possibly the worst Internet neologism in recent memory.” I am actually quite fond of it, it reminds me of similar wards like “node” and other single syllable words familiar to programmers that are used to represent single atomic units of something. It can hardly qualify as the worst, a position which I believe is still safely held by the word “blog.”

Vershbow points out some of the features of the knol project which I think are commendable and which resemble some of the best ideas out there: 1) Anyone can write 2) Multiple knols can compete on a single topic 2) Readers can rate the articles 3) a “Darwinian writers’ market where the fittest knols rise to the top.

This sounds a lot like what I had imagined for a CMS but I think the key would be a license that would allow any future or competing writers to use any or all of previous knols to build better articles.

One of Vershbow’s main concerns, which he shares with Anil Dash of Six Apart, is that Google is suffering from a kind of lack of “theory of mind” - an inability to understand the contradiction between what it is: a large profit-run corporation whose profits are intricately connected to the kind of content its searches produce, and its altruistic dreams.

While I share with Vershbow and other Google critics a whole host of complaints about Google projects such as Google books, which I have on occasion gone into some length here at Muninn, I am a bit surprised at critiques like this which seem to attack Google’s new projects almost on principle. He also has deep worries for the future when knol articles might come to displace untainted non-Google articles in the search results.

It is not so much that I disagree with Vershbow’s deep suspicions about Google or pessimism about the role of mammoths like Google in both being a host of content (Youtube, Google Books, Knol) and the most popular manager and ranker of metadata about such content, since I’m sure I can be persuaded with good arguments.

It is the complete lack of confidence in the contributors of content, in the authors, experts, and web users of the future. I think Google’s hegemony is limited and requires our continued complicity. The knol project doesn’t lock content in, as far as I understand it, especially if users can choose their own licenses.

Finally, Vershbow, like Quiggin, has doubts about the author-centric nature of the project.

The basic unit of authorial action in Wikipedia is the edit. Edits by multiple contributors are combined, through a complicated consensus process, into a single amalgamated product. On Google’s encyclopedia the basic unit is the knol. For each knol (god, it’s hard to keep writing that word) there is a one to one correspondence with an individual, identifiable voice. There may be multiple competing knols, and by extension competing voices (you have this on Wikipedia too, but it’s relegated to the discussion pages).

Vershbow intelligently withholds final judgment on whether this author based approach, similar to Larry Sanger’s Citizendium, will work out but raises many doubts:

I wonder… whether this system will really produce quality. Whether there are enough checks and balances. Whether the community rating mechanisms will be meaningful and confidence-inspiring. Whether self-appointed experts will seem authoritative in this context or shabby, second-rate and opportunistic. Whether this will have the feeling of an enlightened knowledge project or of sleezy intellectual link farming (or something perfectly useful in between).

I think he is right to have such doubts, but could we not raise a whole host of similar questions about Wikipedia, the tool which know even its most hostile detractors around me use on a daily basis? Ultimately, Vershbow is inclined to trust Wikipedia, which “wears its flaws on its sleeve” and works for a “higher aim.” Google’s project, after all, is born in sin, tainted as it is by its capitalist origins.

My own feeling is that as long as the content is not locked in, signed away to Google, we shouldn’t conflate the sinner with the products of her collaborating contributors. This is a great time to test a (at least in some ways) new model for knowledge sharing.

I still believe this new approach would stand the best chance of making an improvement over existing alternatives if it was more dictatorial in one respect: that all contributions should be released with some license which requires a minimum level of permission for sharing - so that future competing writers of knols can either provide fresh competing articles, or, at some or all sections, quickly and easily lift and modify chunks of earlier knols, perhaps with due attribution accessible somewhere from the Knol’s page. That would allow it to combine the best of Wikipedia’s collaborative approach, with the benefits of author-based control.

Print This Post
Tech and Workshop20 Nov 2007 09:45 am

The song name, artist, and album tags in many music files (whether they are acquired legally or otherwise) from Chinese and Korean sources are completely garbled in iTunes on a Macintosh. I assume this is because iTunes assumes that the text is one encoding (Unicode or MacRoman?) and they were in fact encoded in another (often EUC_KR for Korean, Big5 for Taiwanese files, GB for files with simplified Chinese characters). I used to frequently get this problem with Japanese music files but for some reason (perhaps because Unicode is more popular in Japan?) this has gradually become less of a problem.

Fixing these tags can be a pain and some of the older tools such as once awesome “MP3 Rage” and “ID3 Editor” often make things worse due to their inconsistent handling of 2-byte non-Roman languages.

An Apple Support page, however, recently pointed me to a great shareware application ($12) called ID3Mod2 which looks like it is made by the same people that made the incredible Chinese input method QIM that I talked about in an earlier posting (I don’t know this developer personally so it is not as if I’m trying to find good things to say about their work). You can freely use the software for a number of days, during which I was able to go through and fix all of the garbled tags in music files I have collected in China, Korea, and Japan over the last decade. Amazing - I might now actually learn the names of some of the songs I have been listening to for so long and someday even gather the courage to request them on a future karaoke adventure.

Print This Post
China and Language and Tech12 Aug 2007 12:31 am

Apple’s Macintosh operating system and the Chinese language have a long history. Many years ago, when I was an undergraduate college student, well before the advent of Mac OS X and the rise of Unicode, I was already happily inputing Chinese on my Mac and delighted in amazing my friends with the Apple Chinese voice recognition software I had gotten soon after its release in 1996. Meanwhile, PC users I knew across campus and the world were drowning in the technical challenges of mysterious programs such as Twinbridge and its earlier and more obscure competitors. I know from my own experience as a former tech support geek at Columbia University that the legacies of these issues continue to haunt Chinese language departments around the US.

With Windows XP, however, Microsoft finally started getting their act together and created a typically clunky but still relatively easy method (with about a dozen clicks + the use of your OS cd) for adding Chinese input to a non-Chinese OS. Since then I have felt that the Mac Chinese input options lagged behind, especially in the convenience of inputting traditional characters (繁體/繁体). The “Hanin” input method was something of an improvement, but with tens of millions of customers in China using pirated copies of Windows XP and only a handful using the more expensive Macintosh solutions for their computing, it is not surprising that Apple has lost its innovation edge in the area of Chinese input.

Well, I have apparently been somewhat out of the loop since mid-2006. Today I took a few minutes to skim through a year or two of the postings on the Google Group “Chinese Mac.” Thanks to this I was able to learn about a fantastic new piece of software for the Mac:

QIM Input Method ($20)

You can read a bit more about the software on the internet’s premier resource for (English language) information about inputting Chinese on the Mac.

I would recommend anyone who inputs Chinese frequently on the Mac to try out QIM, which is fantastic. I dished out the $20 within 10 minutes of confirming that the software works in all the basic work applications I frequently use Chinese in (Omnioutliner, Microsoft Word, Wenlin, Apple Mail, iFlash). QIM produces characters in real time as you type, has amazing shortcut options, and optionally defaults all output to traditional characters.

Print This Post
Tech17 May 2007 08:11 pm

DSCF0771.JPG

During the past year or so, and especially in the last few weeks, I owe a great deal of thanks to a machine I call a PDF scanner, since I don’t know what it normally called.

The scanner looks like a photocopy machine with a computer screen attached to it. Like a regular photocopy machine you can use the glass or the feeder on top to copy documents and books at the same speed as you might expect from such a machine. However, instead of charging you money and outputting these copies on regular paper, the result of the free scan is displayed as thumbnails on the screen to the right. When you are happy with the resulting scans, you may save them together as a PDF (or as separate image files) and have the file sent to a USB drive or to a server of your choice via FTP.

The machine can be set to a number of resolutions (200 dpi and up) and scans in black and white, grayscale, or color. You may also indicate the paper size of the scanned image. If you are using the feeder tray, you may scan either single or double-sided documents. The model I have used on campus does not have shrink or enlargement features available and lacks some of the other advanced features we are used to dealing with on a regular photocopy machine. However, if you are scanning English language documents, there is one wonderful extra feature: Putting a check next to “Hidden Text Layer” will direct the machine to OCR the scanned pages of text and make the PDF documents searchable. The accuracy is far from perfect, but more than good enough to make those usually dead images great for keyword searching.

This machine, and in one case a different variation of it, can now be found at several library locations throughout the Harvard campus. Competition for its use is heavy in some libraries, especially those where visiting researchers are desperate to copy materials before they return home and want to avoid the costs of large amounts of photocopying and the weight of carrying these copies back.

The advantages of this machine are huge:

1. Use of the machine is completely free (at least on our campus). This has probably saved me hundreds of dollars in the past year and a half or so.
2. Except when scanning poor quality documents or large amounts of double-sided documents using the feeder tray, there are far fewer jams and other problems which arise with using a photocopy machine.
3. There is no wasting of paper or ink. No paper also means no lugging around heavy photocopies.
4. The scans are at a very high speed and surpass the speed of any but the most expensive personal scanners and is much faster than most document feeding trays I have seen.
5. The scanner’s glass is much larger than all but the most expensive personal scanners and can thus easily handle very large books.
6. The OCR text recognition provides no opportunity for correcting mistakes but is transparently built into the scanning process. You never actually see it happen. It adds only a short time to the final saving of the document as it is transferred to the USB drive. This dramatically reduces the time the OCR process would take if you were to do it after scanning documents on a personal scanner with something like OmniPage Pro or using Adobe Acrobat Professional or other tools.
7. Easy OCR means searchable PDFs which means faster research through your own scanned materials.

Potential general complaints from the perspective of librarians and researchers:

1. The product is a scan - which you view on a screen. This is less fun to read than on paper and less convenient to annotate and scribble on.
2. Free and fast copying means that violating copyrights in the library is now free and fast too. Since the products are PDF files, rather than a single hard copy, it is easier than ever to distribute these PDF in ways that violate copyrights.

What have I found this useful for?

1. I digitized the entire Sino-Japanese studies journal, which is now hosted online. I have been wanting to do this project with Josh Fogel for a long time and only with the introduction of these PDF scanners around campus has it become something manageable with a limited budget of time.

2. I have boxes upon boxes of photocopies that I have made throughout the years. Dragging them around is a pain. The PDF scanner has allowed me to eliminated several boxes of paper (I simply haven’t had the time to go through them all, and I want to keep some highlighted materials and materials that don’t scan well). These documents are now all on my computer, and backed up on other media.

3. I often take handouts from presentations, various mail and personal documents, and scan them up quickly using the document feeder.

4. Any books I might need to have as reference in the field but which I don’t want to bring with me in my baggage, I simply scan up before I go. It takes me about 30 minutes to scan a 300 page book, or about ten pages per minute. It takes another 2 minutes to save the book if you choose black and white at 200 dpi. This means that many of my favorite history books in my field are not only on my computer, but those in English are easily searchable, thanks to the OCR feature included on the machine. I can then leave the original book in storage while I travel around in East Asia. When you are sitting in an archive or on a train in the middle of nowhere, without any internet connection or access to Google books and other search engines - there is nothing like being able to search through a lot of locally stored data on one’s own machine.

Wish List for the Future

1. As more and more people around Harvard campus discover the power of these machines to reduce paper and produce OCRed PDF files of everything from our personal papers, I have watched as competition for their use has exploded - especially for the PDF scanner in Harvard-Yenching library. I hope that the librarians come to see that the advantages outweigh the disadvantages and add more machines to the collection. I would also love to see PDF scanners in libraries and especially archives around the world. The National Archives, for example, is perfectly happy to have me click away with my personal camera at thousands and thousands of pages of articles but still charges considerable photocopying fees. If the archives had a PDF scanner (perhaps the alternative kind found in Harvard’s Widener library Philip Reading room which is face-up rather than face-down and thus less damaging to books) they could seriously cut on machine maintenance fees while providing an incredibly valuable service to researchers.

Obviously the question of copyright needs to be addressed - but the solution is not to cripple the gains from technology advances that improve on existing tools that perform the same essential task: the paper-based photocopier, the slower personal scanner, and the camera, all of which we have had for years.

2. I would love to see these machines support OCR in many more languages.

3. It would be nice for there to be some kind of semi-automated “submission” or “registration” system for scanned materials so that eventually you can reduce the physical burden on the scanned materials in libraries and archives. If certain pages, articles, or archival documents have been scanned before, and are found in the system, then you could simply retrieve this previously scanned document and thereby contribute the preservation of the original by not subjecting it further copy.

4. I would like these machines to have more options than the software they currently have provide such as enlarge/shrink options, crop features, auto-crop features, more media size options, much better color scans of glossy photographs, etc.

Honorable Mention

Another similar machine that I also owe a lot to recently is the Microfilm PDF scanner. A number of my recent postings at Frog in a Well and contributions to the Frog in a Well Library refer to documents that I found on microfilms. The documents I have been uploading are PDFs directly created by the PDF scanning software on the computers attached to the microfilm reading machines that I use in the Government Documents section in the basement of Harvard’s Lamont library. It works very much like the microfilm printers we have seen in libraries for years but this time the product is a PDF rather than paper copies. Like the regular PDF scanner above, all these scans are free and allow me to easily share my findings with others.

Print This Post
Tech and Workshop18 Jan 2007 02:16 pm

I got a cheap used Griffin AirClick for USB to control my older laptop Macintosh by remote control. Another remote I like better (KeyPOINT) has been acting up so I got the Griffin as a replacement. The downside with Griffin is that it has fewer buttons, no mouse control, and a limited set of applications that it works with. One of the applications that I want to use the remote with is the best flashcard program on the Macintosh, iFlash. I use this almost every day to practice Korean vocab and other languages. Since this is not one of the supported applications, this afternoon I hacked the AirClick.app program that comes with the remote to add support for iFlash. You may download my modified version of the AirClick application here.

For those who wish to add support for their own program I briefly outline how I did the hack below:
(more…)

Print This Post

Next Page »

Creative Commons License