Last week I paid a visit to a wonderful archive in a medium sized city of Shandong province, China. There I looked up various documents from the 1940s for my dissertation research that are a bit more local in scope than those I have been looking at in the Shandong Provincial Archives here in Jinan.
The archivists were incredibly friendly, and warned me in advanced that they didn’t think they would have too much from the period I was looking at. After providing the letters of introduction that are required at most archives in China and having the way paved for me thanks to a phone call from a contact I made in Jinan, I was allowed to search for documents using their digital database. They even gave me a free lunch from their cafeteria on the first day and a free copy of a book they had published that I was interested in getting containing documents from the wartime period.
Unlike the provincial archives, this archive found their collection manageable enough to scan and store digitally copies of all the files and make them available for viewing by visitors in place of the originals. Unfortunately, I was not given the option of looking at the originals instead. Also unlike the provincial archives, the online search of their database seems to return results from a much larger proportion of materials that are found by searching for the same on their internal database. They did not allow me to save any of the digital TIF image collections of individual documents onto a USB drive but I was allowed to print documents and, after their contents was checked over by the archivist, to make off with these environmentally less friendly non-digital printouts.
Unfortunately, almost everything that could have been done wrong with this digitization program and its presentation to the visitor did. So let me list of the issues as a warning to other, especially smaller archives, that might consider going the digital route. I have listed them from the least worrisome to most serious:
1) Environment: The computer designated for viewing of documents had a cheap monitor with little screen brightness (even when set to full) which faced a window where sunlight beamed into the room (even when I convinced them to partially lower shades), providing a horrible viewing experience and harm to the eyes. An uncomfortable mini-mouse, horrible chair, and a table with almost no spare room for visitors to put a notebook or their laptop made this a nightmare to spend any length of time looking at documents.
2) Software: The custom built database software had an advanced query system which is useful for advanced users and archivists but requires multiple stages to search and although I quickly got used to it, I think it would confuse users not used to such systems. Also, when it shows images of archive files, a lot of vertical screen space is wasted on software options and interface components, which leads to a great deal of scrolling at any zoom level that makes reading possible.
3) Page Numbers: At the archive in question I requested a lot of documents where essentially local versions of other documents that I had seen before from other districts. Having seen many originals of this kind I know most of them are one small A5ish sized sheets of very thin paper that are held together with string. Despite the age of these documents, surprisingly I have never run into paging issues at the provincial archives, mostly because I’m seeing them still stringed together. By contrast, pages were all over the place in these documents in their digital form. While it is possible they were already unstringed and in messed up order when the contractors got the documents, I suspect that they got messed up through negligence when the originals were unstringed in order to be scanned.
4) Indexing: This is a very serious problem I found with all but two of the 70 or so documents I looked up during the two days I was at the archive. Before coming to the archive, I used the online database I made a list of file names and file numbers for documents I was interested in. I brought these to the archive and looked up the same numbers in the internal database. Each file number, unfortunately, corresponds to a packet of multiple files ranging, at least judging by what I saw, from 15-50 or so in number. I could then easily locate the appropriate document by its file name and open the images directly in the system. To my horror, in all but two of the cases, the documents in the file images did not correspond to the file name. For each document I would have to hunt through the other dozen or several dozen documents in the same general area to find the images for the file I was looking for. Sometimes I was never able to locate the file, suggesting that those images are probably found in other file groups, if at all. Now, what am I supposed to do as a historian when I cite the documents I did find? I’ll record the correct file numbers, found in the database, but any other historian wishing to confirm the information I am citing will look them up and find a completely different document unless the archivists have gone in and fixed all the indexing issues throughout their scanned collection.
I asked two of the archivists about this issue and I essentially got a, “That is funny. Well, just hunt through the rest of them and find your document. It’s probably like that for this whole collection. We paid a contractor to have it done and didn’t have the resources to check all their work.”
5) Quality: The documents I’m looking at are Communist public security bureau reports and Communist party internal reports. Some of them are hand written or are characters carved onto a special surface that allows a sort of reproduction process frequently used in the 1940s (any printing history buffs know what this ancient photocopying method is called?). In either case, they are very difficult to read, faded with time, on surfaces that are themselves often in poor condition, and most importantly, written in tiny sizes. If you are going to digitize these kinds of documents, then, you need to digitize them with a much higher quality. As I mentioned in my posting on triage in the archives, I have had to sometimes completely skip some of the more hopelessly unreadable documents or those for which the pages per hour drops to a rate that makes the investment of time not worth it. I would say that this happens in perhaps 1/10 documents I look at here.
Now, take these same kinds of documents and scan them. If you scan them well, at high resolution and with color, then you can actually make those difficult to read but important sections more readable thanks to the power of zooming in on parts of the image. However, that is not what happened here.
The contractors here decided to take these extremely difficult to read originals and scan them in black and white (not even in greyscale!). Now I know the evidence seems to suggest that if you are going to run a massive scale OCR program on historical newspapers, for example, then black and white is not significantly worse than greyscale. However, OCR is not even worth trying on these hard documents, unless there are some major breakthroughs in artificial intelligence. If, however, you are trying to use human eyes to read difficult to read handwritten or carved Chinese characters on poorly preserved mediums, you need to preserve as much of the quality of the originals as possible. The cost benefit analysis done in this case resulted, in the case of many documents, in completely unreadable digital copies.
This really left me depressed. In the case of the completely botched indexing described in number four above, an archivist or the hired contractor can go back and meticulously re-index the documents so that they point to the correct images. Since some of the documents have visible page numbers, messed up page numbers might also be fixed in those cases. However, I suspect it is harder to go back and explain to the budget committee, “Ya, our contractor blew the scanning job and made thousands of once barely readable documents in our collection now completely unreadable to visitors. Can we pay to do the scanning all over again?”
I came back to Jinan yesterday morning and felt incredibly happy to go back to reading similar documents in my own hands. Digitization can do amazing things for improving access and preservation. When the Japanese national library set about digitizing all Meiji and now Taisho period publications I found myself complaining mostly about the slower speed at which I could browse or skim through the books. I didn’t find that readability itself suffered too much during the process. In a case like these far more difficult to read wartime Communist documents, however, sloppy digitization of these documents, only gradually opening up to researchers and historians, actually reduces rather than increases access.