Improve OCR Accuracy
This article describes the factors that affect OCR accuracy.
OCR is a tricky thing. It requires a good, clear document. If the letters are too bold and blur together, the OCR engine will have a hard time figuring them out. Conversely, if the letters are too dim and have "open" sections, it will throw the OCR engine off. This is quite common with faxed documents.
Another common problem is when there are extra speckles or "noise" on the scan. This can confuse the engine. Skewed text can make it throw it off, since the OCR engine expects text that is relatively horizontal. You will also want to avoid decorative fonts, since these can be hard to recognize.
The best image for OCR is going to be black and white at 200-300 dpi. Ideally, it will use standard font faces, like Times or Arial. It should be clear of background noise and have as few images as possible.
Some scanning problems can be cleaned up automatically. For example, many scanners will automatically deskew scans, especially sheet-fed scanners where sheets are sometimes pulled through crooked. Some scanners also have automatic exposure options, which can reduce background noise and make sure that text has the right "weight".
Another factor is the OCR engine itself. FileCenter Professional and FileConvert both include the "Advanced" OCR engine which has excellent accuracy. To select it:
- Go to FileCenter's Scan dialog, select the OCR tab, and change the OCR Engine to Advanced, then click the Save button right above to make this the default
- Go to FileCenter's Settings, select OCR on the left, and change the OCR Engine to Advanced
- In FileConvert (if you use it), edit your job, select the Advanced tab, and change the OCR Engine to Advanced