I finally declared war on my PDF organizing system. I am struggling to manage some 2400 or so PDF files on my computer. This huge number of files consists of downloaded or scanned journal articles, newspaper articles, historical documents, PhD dissertations, books, and various personal documents that I got sick of dragging around the world in fat disorganized folders.
This week I tried to find a dissertation I had read part of which talked about the relationship between liberalism in Japanese domestic politics under the premiership of Hara Kei and the aftermath of the March 1st movement in colonial Korea.
Did I file this PDF away in the Academic Papers/Korea/ folder, the Academic Papers/Japan/ folder, the Documents/To Read/ folder, the Docs/Dissertation/MUST READ/ folder, or was it still stranded in the downloads folder? Apparently none of these and I still haven’t been able to find the damn file. My folder system is an embarrassing mess.
However, in the 21st century, where tagging rules, why should my folder system matter? Why can’t I tag the stupid files and be done with it. If each PDF file can have a dozen tags that I could easily search through later. The file I described above, for example, could be tagged as “Academic Papers, Korea, Japan, colonial period, Taisho, liberalism, political, Hara Kei, March 1st Movement, dissertations”. Well, in order to do this I broke down and paid for the PDF indexing software Yep (Mac only, I’m sure there is something similar out there for those unfortunate Windows users out there).
I feel better now and have already made serious progress. With tag clouds, smart folders, and an iTunes like interface in Yep, I’m hoping I will gradually overcome my jumbled mess of PDF files and therefore be able to write my own dissertation in no time. Or not, but at least I feel less like I’m hunting for a document in a bombed out archive. I hope that future versions of the software will allow the option of including not only .pdfs but also images since scanning documents saved as PDFs is much slower than taking pictures of documents. I have thousands of crisp high contrast black and white photos of historical documents and newspaper articles that I would love to be able to tag in the same way without having to do it in a separate piece of software.
There are half a dozen other programs out there like Yojimbo, DEVON products, etc. which also allow you to store PDF files in a database of various kinds of data that might also include regular text, images, and so on. However, what I don’t like about them is that they are 1) often slow to import the PDFs 2) they are usually importing the PDFs into the program’s database thus swelling the size of the DB and slowing down its overall performance. I am not really interested in having over 30,000 pages of PDFs all inside the DB of a program and would like to keep them scattered in various places on my hard drive only to be indexed by a program by Yep.
Why don’t you use Acrobat to OCR the PDF and index them with Google Desktop (or someting similar for the Mac) as well? The OCR-ed text is saved “behind” the document, you still view the scanned image and but you and search the text. Off course, it’s not 100% accurate but usually enough to make it findable with search.
That is certainly an option. I’m not a big fan of Google Desktop and spotlights works well enough. The tag clouds in Yep work great for finding things quickly by tags.
OCRing the texts work well enough with Acrobat, I did that with all the articles I posted to the journal articles on http://ChinaJapan.org/ so that Google could search the contents of the PDF. As you say the accuracy was hit and miss but still enough to make searching the documents generally useful.
Doing that for all the files is a bit time consuming. Acrobat OCR, even on the fastest Windows machines in the statistics lab at Harvard was slow enough for me not to do it with every downloaded file. Newer online journals usually have an OCR layer already, or even selectable text.
The windows users indeed has an equivalent (and more) solution, it is called 42Tags.
The original files are kept under ’42Tags’ folder, this makes it very easy to backup. You can always access the file with regular explore as well.
One advantage over the ‘yep’ solution is the introduction of the ‘package’ notion which you can add
few files (pdf, tif, doc, other) tag them as a package and later you will find them together.
(e.g. ‘My CV’ – cv.doc, recommendation1.jpg, etc..)
You can see a video demonstration at our site http://www.42tags.com/video.htm
PS – I have a question Munnin regarding yep, it is probably keeping links to your files, but what if you move files and folders around, how yep will treat it ?
Thx for passing on that info.
I’m not sure how Yep stores the info exactly but I just switched to “Leap” which is made by the same company. Unlike Yep, Leap is the same thing for all files, rather than just PDFs and they just released their final 1.0 release this week.
Leap has something similar to your “packages” in the form of two features: “File groups” and bookmarks which are smart groups that save a particular combination of tags and other characteristics. This is looking great so far but I continue to send them comments.
If Leap is keeping metadata about files in its DB, they are perhaps using Mac aliases since I am easily able to move files anywhere and the metadata remains unaffected. However, there is one problem: in the current release renaming the files outside the application seems to at least sometimes cause the loss of tags. I have contacted them about this problem.
In Leap there is also the option of saving all the tags in the “Spotlight Comments” of each file – a special tag attached to all files in the OS which make it possible to use the OS wide Spotlight search (like Google Desktop for Windows) and find these tagged files. These are preserved after both moving location and renaming of files.