Digitizing Text, Part 1: OCR

by JL Beeken on 6-09-2007

So, how’s it going with digitizing all that paper? Or have you found everything you need on the Internet?

I have lots of family history on paper, and for two reasons … two cousins who snail-mail it to me. One who travels a lot and likes to print everything she finds, and another who doesn’t have a computer and has done 9 years of research with nothing but stamps, envelopes and a local library.

Since I can’t stand the thought of all this effort rotting away in boxes, (or god forbid, meeting its death by something catastrophic) I’ve become The Grand Digitizer. And, besides the backup function, I like to get some of this information into my database, too. Obviously. So, faced with mounds of shredded trees, I took the plunge …

Digitizing generally begins with a scanner, (although I’m wondering lately if some keen digital photography might do just as well on some things. Not my preference but some people may have mastered this.) And unless I’m missing something (and that’s possible) there are only two ways to save scanned text. Either as text or as graphics.

Old and fuzzy text can be typed into your computer by hand if there’s not too much of it, or page after page if you really like typing. If not, it will need to be scanned and saved as graphics. Newer and clearer text can be OCR’d.

What is OCR? If you have a scanner you may have received a mini-version of an OCR program along with it. If not, you can try a free one called SimpleOCR. OCR stands for Optical Character Recognition. At its best, it will recognize text in an image format and make it possible to go directly from there to text in a text document. This depends on the quality of your text.

My mini-version will not save text-formatting, but if it’s in your budget you can find an OCR program that says it will. OmniPage (from Nuance, formerly Scansoft), basic version, $150, no trial, the Pro version, $500, free trial, or Abbyy FineReader for $400 and others. If you have Microsoft Office this article may interest you.

This example may not be exactly your situation, but it shows the basics. In this case I’ve opened an image directly from my hard-drive. This is a screen-shot taken from an online newspaper.

Abbyy FineReader

I click the Read button and get this result. The blue highlights represent letters it’s not sure about.

Abbyy FineReader

Or I can scan an image in directly by clicking the Scan&Read button. Because this is old newsprint, the program does not recognize all the text clearly and I will have to fix that myself. The clearer the text the better the result. It’s important to compare the text on your graphic with the OCR results for any discrepancies. Through experimentation you’ll discover what OCR is good for and what’s not worth it.

There’s two things that can be done now. Either click the Save button and save it as a .txt or .rft document or drag and drop the text to somewhere else. If I’m working with a multi-page document I prefer drag and drop.

Drag and drop is done like this:

Open a new text document (that can be done by right-clicking on your desktop and choosing from the “New” menu) and minimize it to the taskbar. Then drag your cursor across the text above to highlight it and drag it down to the taskbar, holding it over the minimized text program until the window pops open and then drop it in. When I OCR the next page I can drop that in under the last one, and keep on going until I’m done. This trick can be used between most programs when you want to move content from one place to another.

If you have something like this, (below) re-typing it by hand would be a nightmare extraordinaire, so I would recommend scanning it and saving each page as a graphics file. This is not a chart from anyone’s database. It was put together on a typewriter decades before “computer” was even a word. An OCR program will probably recognize most of the letters without any trouble, but unless you have a fancier version than I do, you will lose all the formatting in the process. Reformatting page after page of something like this would not be worth the trouble.


The downside of a graphics file is that you can’t search the contents. And you can’t copy and paste a section of it. Nevertheless, I just don’t know of a better solution. The upside of saving it as a graphics file is if there’s also a photograph amidst your text the whole page is still just one graphics file. If you OCR text and want to include any photos on the page, you either have to scan the photo separately and put it there, or hope a paid program has the capacity to maintain your original layout completely.

If you have text of this quality, (below) an OCR program will only recognize some of the letters, and you will end up re-typing most of it anyway. So, in this case, either scan and save it as a graphics file or type it or both. If you already have it as a graphics file, you can use Transcript to type it out.

Transcript
OCR has been improving in its recognition abilities (they say) by leaps and bounds so what a newer version will do compared to my 2001 mini-program may be considerable. I tried a newer one a couple of years ago and was still not impressed. Maybe it’s time to try again.

When it comes to scanning and saving multiple graphics as single documents there are better and worse ways of going about it. Stay tuned for Digitizing Text, Part 2.

{ 1 comment… read it below or add one }

Harry 11-11-2010 at 11:55 PM

It’s not until recently when I find it is possible to do character recognition online, for example with the site Free OCR. This site helped me work today to convert an image file (like tif files ) to readable text. The result is not bad.

Reply

Leave a Comment

Powered by sweet Captcha

Previous post:

Next post: