When Archive Digitization Goes Wrong

Last week I paid a visit to a wonderful archive in a medium sized city of Shandong province, China. There I looked up various documents from the 1940s for my dissertation research that are a bit more local in scope than those I have been looking at in the Shandong Provincial Archives here in Jinan.

The archivists were incredibly friendly, and warned me in advanced that they didn’t think they would have too much from the period I was looking at. After providing the letters of introduction that are required at most archives in China and having the way paved for me thanks to a phone call from a contact I made in Jinan, I was allowed to search for documents using their digital database. They even gave me a free lunch from their cafeteria on the first day and a free copy of a book they had published that I was interested in getting containing documents from the wartime period.

Unlike the provincial archives, this archive found their collection manageable enough to scan and store digitally copies of all the files and make them available for viewing by visitors in place of the originals. Unfortunately, I was not given the option of looking at the originals instead. Also unlike the provincial archives, the online search of their database seems to return results from a much larger proportion of materials that are found by searching for the same on their internal database.1 They did not allow me to save any of the digital TIF image collections of individual documents onto a USB drive2 but I was allowed to print documents and, after their contents was checked over by the archivist3, to make off with these environmentally less friendly non-digital printouts.

Unfortunately, almost everything that could have been done wrong with this digitization program and its presentation to the visitor did. So let me list of the issues as a warning to other, especially smaller archives, that might consider going the digital route. I have listed them from the least worrisome to most serious:

1) Environment: The computer designated for viewing of documents had a cheap monitor with little screen brightness (even when set to full) which faced a window where sunlight beamed into the room (even when I convinced them to partially lower shades), providing a horrible viewing experience and harm to the eyes. An uncomfortable mini-mouse, horrible chair, and a table with almost no spare room for visitors to put a notebook or their laptop made this a nightmare to spend any length of time looking at documents.

2) Software: The custom built database software had an advanced query system which is useful for advanced users and archivists but requires multiple stages to search and although I quickly got used to it, I think it would confuse users not used to such systems. Also, when it shows images of archive files, a lot of vertical screen space is wasted on software options and interface components, which leads to a great deal of scrolling at any zoom level that makes reading possible.

3) Page Numbers: At the archive in question I requested a lot of documents where essentially local versions of other documents that I had seen before from other districts. Having seen many originals of this kind I know most of them are one small A5ish sized sheets of very thin paper that are held together with string. Despite the age of these documents, surprisingly I have never run into paging issues at the provincial archives, mostly because I’m seeing them still stringed together. By contrast, pages were all over the place in these documents in their digital form. While it is possible they were already unstringed and in messed up order when the contractors got the documents, I suspect that they got messed up through negligence when the originals were unstringed in order to be scanned.

4) Indexing: This is a very serious problem I found with all but two of the 70 or so documents I looked up during the two days I was at the archive. Before coming to the archive, I used the online database I made a list of file names and file numbers for documents I was interested in. I brought these to the archive and looked up the same numbers in the internal database. Each file number, unfortunately, corresponds to a packet of multiple files ranging, at least judging by what I saw, from 15-50 or so in number. I could then easily locate the appropriate document by its file name and open the images directly in the system. To my horror, in all but two of the cases, the documents in the file images did not correspond to the file name. For each document I would have to hunt through the other dozen or several dozen documents in the same general area to find the images for the file I was looking for. Sometimes I was never able to locate the file, suggesting that those images are probably found in other file groups, if at all. Now, what am I supposed to do as a historian when I cite the documents I did find? I’ll record the correct file numbers, found in the database, but any other historian wishing to confirm the information I am citing will look them up and find a completely different document unless the archivists have gone in and fixed all the indexing issues throughout their scanned collection.

I asked two of the archivists about this issue and I essentially got a, “That is funny. Well, just hunt through the rest of them and find your document. It’s probably like that for this whole collection. We paid a contractor to have it done and didn’t have the resources to check all their work.”

5) Quality: The documents I’m looking at are Communist public security bureau reports and Communist party internal reports. Some of them are hand written or are characters carved onto a special surface that allows a sort of reproduction process frequently used in the 1940s (any printing history buffs know what this ancient photocopying method is called?). In either case, they are very difficult to read, faded with time, on surfaces that are themselves often in poor condition, and most importantly, written in tiny sizes. If you are going to digitize these kinds of documents, then, you need to digitize them with a much higher quality. As I mentioned in my posting on triage in the archives, I have had to sometimes completely skip some of the more hopelessly unreadable documents or those for which the pages per hour drops to a rate that makes the investment of time not worth it. I would say that this happens in perhaps 1/10 documents I look at here.

Now, take these same kinds of documents and scan them. If you scan them well, at high resolution and with color, then you can actually make those difficult to read but important sections more readable thanks to the power of zooming in on parts of the image. However, that is not what happened here.

The contractors here decided to take these extremely difficult to read originals and scan them in black and white (not even in greyscale!). Now I know the evidence seems to suggest that if you are going to run a massive scale OCR program on historical newspapers, for example, then black and white is not significantly worse than greyscale. However, OCR is not even worth trying on these hard documents, unless there are some major breakthroughs in artificial intelligence. If, however, you are trying to use human eyes to read difficult to read handwritten or carved Chinese characters on poorly preserved mediums, you need to preserve as much of the quality of the originals as possible. The cost benefit analysis done in this case resulted, in the case of many documents, in completely unreadable digital copies.

This really left me depressed. In the case of the completely botched indexing described in number four above, an archivist or the hired contractor can go back and meticulously re-index the documents so that they point to the correct images. Since some of the documents have visible page numbers, messed up page numbers might also be fixed in those cases. However, I suspect it is harder to go back and explain to the budget committee, “Ya, our contractor blew the scanning job and made thousands of once barely readable documents in our collection now completely unreadable to visitors. Can we pay to do the scanning all over again?”

I came back to Jinan yesterday morning and felt incredibly happy to go back to reading similar documents in my own hands.4 Digitization can do amazing things for improving access and preservation. When the Japanese national library set about digitizing all Meiji and now Taisho period publications I found myself complaining mostly about the slower speed at which I could browse or skim through the books. I didn’t find that readability itself suffered too much during the process. In a case like these far more difficult to read wartime Communist documents, however, sloppy digitization of these documents, only gradually opening up to researchers and historians, actually reduces rather than increases access.

  1. When I asked one of the archivists at the provincial archives why they did not provide full online access to the database, rather than a very small sampler of the full internal database so that visitors could come prepared with a list of documents to request, I got a bewildered and serious look, “Do you want to put me out of a job?” This answer only makes sense if you realize that one of the primary duties of two of the archivists is to sit at the database search engine and help first time visitors search for documents. Given the fact many of the, especially older, visitors are completely computer illiterate, however, I still believe their services would continue to be required to help elderly comrades who come to search for their records. []
  2. though, as was the case with the Korean national archive, it would have been simple enough for a less scrupulous person to do this given the access to the “Save As…” option in the file menu and apparent lack of any security on the machine I was given access to. In fact, in the case of the Korean national archive at Daejeon, web browser access was restricted but I was able to confirm, at least as of 2008, the DOS command line still gave me FTP access to my server where I could have uploaded hundreds of pages of Korean archive documents they were requiring me to wastefully print and pay for, had I been so inclined to disregard their rules. []
  3. A bizarre and surely unnecessary step, since the documents have been screened once when they were added to the database for classified information. I could easily note down in my notes anything I read in the documents before printing them so not letting me keep the print outs hardly serves to prevent sensitive or privacy violating information from leaking out. If privacy issues are primary there should be a system, like the one at the Korean national archive, which charges the visitor to process accessed documents to redact out the names of people mentioned. At the Pusan branch of the Korean National Archive I paid about $50 and waited three days to get access to some old police logs. It took that much time because they had to go through and erase the names and provide me copies. However, I’m still grateful I got access at all. Although this is an important issue that deserves consideration, I generally feel that the privacy laws of Korea and Japan are far too strict and that they seriously inhibit serious historical work from the 19th through the period I’m working on in the mid-20th century []
  4. Note to super friendly archivists: if you encourage a visiting PhD student to eat while looking at the documents by suddenly (and generously) giving him a handful of juicy baby tomatoes, you might end up with a bit of tomato juice on one of the pages of part two of the 1946 treason elimination report from the Donghai public security bureau of the Jiaodong district. []