What is OCR?
Good quality Optical Character Recognition (OCR) is an unquestionably effective tool for extracting textual content in machine readable and editable format from a digitized image. The following steps are involved in the process – from scanning a paper document to getting the textual content out of the scanned image.
1. Use a compatible good quality document scanner to scan a paper document into suitable electronic format, such as JPEG, TIFF, PNG, PDF etc.
2. Preprocess the scanned document for skew correction, noise removal, and some mathematical transformations.
3. Perform Line Segmentation from the processed document, i.e., segregate each line from the scanned document.
4. Perform Word Segmentation, i.e., segregate the words of each line from the previous step.
5. Perform Character Separation, i.e., separate the characters of the segregated words from the previous step.
6. Perform Character Recognition, i.e., recognize each of the above segregated characters by looking into the pattern. Once the pattern are recognized the corresponding textual character is obtained. The pattern recognition may involve different types of methods, such as statistical analysis, neural networks, structural matching etc.
7. From character to word, from word to line, and from line to the entire document becomes recognized and its textual version is obtained.
In short, this is the main process and objective of Optical Character Recognition (OCR).
There are different OCR products – some commercial and some open source, such as the widely used / known “Tesseract“, which is distributed as part of various types of product offerings from different companies. SARANGSoft offers a version of Tesseract as a free download from its website. It enhances the functionality of digipaper (http://sarangsoft.com/product/digipaper) by adding the OCR functionality.
Different Types of OCR
OCR is available for different languages, including English. Several successful researches have already been done and are still going on. Sometimes a document may contain multiple languages (such as forms). In that case OCR become more complex. An OCR engine for one language, such as English, is not sufficient in that case. The output textual content also would show textual form of different languages / scripts. Hence, this is known as Muitisctipt OCR as opposed to Single Script OCR in the simpler case of only one language / script.
It is also important to keep in mind that the textual content may be in printed form, or in handwritten form, or both. Very few OCR tools are effective in recognizing handwritten text, and most OCR tools focus on and handle only printed documents.
Teeseract supports recognition of texts in multiple languages. However, our experience is more focused on the English language. From what we have noticed, Tesseract doesn’t handle mixture of languages too well.
No OCR tool can guarantee a 100% accurate output because of different reasons:
- A document might contains clean text or it can be a a bit fuzzy / unclear / noisy, especially if it’s an old document where the contents are fade. Unclear / noisy documents should be run through some preprocessing like color and brightness adjustment, noise removal, filtering etc. to improve its clarity. Cleaner the document, more accurate the OCR result.
- Documents in grid format (i.e., in rows and columns) may also cause problems in accurately recognizing the textual contents. At times some of the (printed) texts go across the gred cell boundaries, thereby being more prone to error during recognition.
- Scanning of the paper document is an important factor in the recognition process. If the document is shrunk or skewed by a significant level, the preprocessing may not always back the document in its actual form. This also often lead to inaccurate OCR results.
- Recognizing handwriting is always hard, especially it’s cursive, in which case special processing (including Neural Networks techniques) may be needed to recognize the characters from the flowing nature of the writing. Bad handwriting is hard to recognize (both by humans and machines).
In the business world, OCR can be used for long-term digital storage of important papers, such as invoices, financial statements, reports, legal documents, applications, permits etc., especially if the volume of such papers grows too much. In some organizations, there are dedicated staff members to manually enter the details from those paper documents into digital formats (e.g., spreadsheets) to collect the data and make it available for processing. Another reason for digitizing paper documents is safekeeping and organizing for the long-term storage. If the digitized documents are properly “tagged”, an added benefit is the ease and speed of retrieving those when needed. However, manual entry of data and/or manual tagging is both error-prone, time consuming, and often inconsistent between individuals. Also, any error in manually entering data can cause major problems for the organization. OCR can be a great help in reducing the use of manual steps in the overall digitization process. The textual content from the scanned document images can be extracted using OCR. Then those can be put into appropriate places, such as in database fields, rows and columns of spreadsheets, or as “tags” for the concerned documents. Please note the word “reduce” above – a lot organizations think that by using OCR they can completely get rid of the manual intervention, which is a bit too much to expect. As no OCR is 100% accurate, output of OCR needs to be manually verified as appropriate for the case. That means instead of “entering” every data item manually, there is need “check” the correctness of OCR-generated data. It’s not safe to blindly trust the OCR output in a lot of cases.
In Banking sector, OCR may be used to recognize check number, account number, bank name, routing number etc. In Educational sector, OCR may be able to help with form processing. In any domain OCR can be used as a part of the digitization process to extract textual content after the scanning and use the textual content as (secondary) tags to later identify the documents.
In short, though OCR reduces manual data entry, thereby saving time and reducing / avoiding errors to a good extent, at the same time keep in mind that OCR output should be carefully checked before using it any ciritical purposes.