Germanna History Notes

Adobe Systems set a fierce task for themselves with the PDF format. They want to take almost any kind of input and produce both visible copy for humans and machine-readable copy for computers so the computers could do a search. When I produced the Beyond Germanna CD in PDF format, I had some of the issues in print only (those produced on that IBM Selectric typewriter) and some of them in computer code. The only thing that I could do with the old typed copy was to scan it. This produces the image and this is what is displayed on the monitor as the output of the PDF format. How is the search made? The image is processed as optical character recognition input and a set of words is produced. These words are searched.

One problem that arises is that the optical character recognition they are using does not recognize umlauts, those three German vowels with the two dots over them. They come through the OCR process sometimes as ff for ü . Since it is this OCR output which is searched, the PDF format cannot search on German umlauts. For example, no searching can be done on Häger. One can search for Haeger or for Hager, two spellings that often occur in the same articles with Häger.

When you think some about it, in this process there must be two documents or pages for every scanned page. The visible page is the scanned page that has been touched up a bit. Behind that, and not visible, is the OCR output used for searching. The visible image is touched up by making the characters look more like the standard fonts. If you are using a font that is not standard, then that font can be embedded in the document and used to improve the looks of the scanned image. The scanned image does show the umlauts.

Some of my older work was done with WordPerfect 5, 6, 9, and 11. I no longer even have the programs for some of these. If I put the data for even WordPerfect 9 into the WordPerfect 11 program, it does not come out the same as it did originally. The net result is that words end up on different pages which is undesirable. I tried to reedit the material to produce the original but that was very time consuming. In the end, I scanned 917 pages of material.

If you have written some material recently, say in Word for Windows, you can convert this automatically to the PDF format. In this case there is only one layer. The visible layer on the screen can also be used for searching. This whole process produces the best result and is easy to do. On my CD, the last three pages (918, 919, and 920) are produced this way. With these three pages there is a direct conversion without scanning. All of the other pages were scanned.

Adobe Systems produces Standard 6, the program used to make the PDF files. (It also produces the Photoshop programs.) The experience of my wife and myself is that they are masters of not telling the customer anything about how to do something. Their instructions usually leave one hanging. Very frustrating.
(19 Mar 04)

We gratefully acknowledge the work of John Blankenbaker who published over 2,500 Germanna History Notes via the Germanna-L@rootsweb.com email list from 1997 to 2008. We are equally thankful to George Durman (Sgt. George) for hosting the list and republishing the notes via rootsweb.com.

John Blankenbaker's Germanna History Notes

Note 1852