View All Items
- Title
- AUTONOMOUS REPAIR OF OPTICAL CHARACTER RECOGNITION DATA THROUGH SIMPLE VOTING AND MULTI-DIMENSIONAL INDEXING TECHNIQUES.
- Creator
-
Sprague, Christopher, Weeks, Arthur, University of Central Florida
- Abstract / Description
-
The three major optical character recognition (OCR) engines (ExperVision, Scansoft OCR, and Abby OCR) in use today are all capable of recognizing text at near perfect percentages. The remaining errors however have proven very difficult to identify within a single engine. Recent research has shown that a comparison between the errors of the three engines proved to have very little correlation, and thus, when used in conjunction, may be useful to increase accuracy of the final result. This...
Show moreThe three major optical character recognition (OCR) engines (ExperVision, Scansoft OCR, and Abby OCR) in use today are all capable of recognizing text at near perfect percentages. The remaining errors however have proven very difficult to identify within a single engine. Recent research has shown that a comparison between the errors of the three engines proved to have very little correlation, and thus, when used in conjunction, may be useful to increase accuracy of the final result. This document discusses the implementation and results of a simple voting system designed to prove the hypothesis and show a statistical improvement in overall accuracy. Additional aspects of implementing an improved OCR scheme such as dealing with multiple engine data output alignment and recognizing application specific solutions are also addressed in this research. Although voting systems are currently in use by many major OCR engine developers, this research focuses on the addition of a collaborative system which is able to utilize the various positive aspects of multiple engines while also addressing the immediate need for practical industry applications such as litigation and forms processing. Doculex TM, a major developer and leader in the document imaging industry, has provided the funding for this research.
Show less - Date Issued
- 2005
- Identifier
- CFE0000380, ucf:46337
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000380
- Title
- OPTICAL CHARACTER RECOGNITION: A STATISTICAL MODEL OF MULTI-ENGINE OPTICAL CHARACTER RECOGNITION SYSTEMS.
- Creator
-
McDonald, Mercedes Terre, M Richie, Samuel, University of Central Florida
- Abstract / Description
-
This thesis is a benchmark performed on three commercial Optical Character Recognition (OCR) engines. The purpose of this benchmark is to characterize the performance of the OCR engines with emphasis on the correlation of errors between each engine. The benchmarks are performed for the evaluation of the effect of a multi-OCR system employing a voting scheme to increase overall recognition accuracy. This is desirable since currently OCR systems are still unable to recognize characters with 100...
Show moreThis thesis is a benchmark performed on three commercial Optical Character Recognition (OCR) engines. The purpose of this benchmark is to characterize the performance of the OCR engines with emphasis on the correlation of errors between each engine. The benchmarks are performed for the evaluation of the effect of a multi-OCR system employing a voting scheme to increase overall recognition accuracy. This is desirable since currently OCR systems are still unable to recognize characters with 100% accuracy. The existing error rates of OCR engines pose a major problem for applications where a single error can possibly effect significant outcomes, such as in legal applications. The results obtained from this benchmark are the primary determining factor in the decision of implementing a voting scheme. The experiment performed displayed a very high accuracy rate for each of these commercial OCR engines. The average accuracy rate found for each engine was near 99.5% based on a less than 6,000 word document. While these error rates are very low, the goal is 100% accuracy in legal applications. Based on the work in this thesis, it has been determined that a simple voting scheme will help to improve the accuracy rate.
Show less - Date Issued
- 2004
- Identifier
- CFE0000123, ucf:46188
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000123
- Title
- TEXT-IMAGE RESTORATION AND TEXT ALIGNMENT FOR MULTI-ENGINE OPTICAL CHARACTER RECOGNITION SYSTEMS.
- Creator
-
Kozlovski, Nikolai, Weeks, Arthur, University of Central Florida
- Abstract / Description
-
Previous research showed that combining three different optical character recognition (OCR) engines (ExperVision® OCR, Scansoft OCR, and Abbyy® OCR) results using voting algorithms will get higher accuracy rate than each of the engines individually. While a voting algorithm has been realized, several aspects to automate and improve the accuracy rate needed further research. This thesis will focus on morphological image preprocessing and morphological text restoration that goes to OCR...
Show morePrevious research showed that combining three different optical character recognition (OCR) engines (ExperVision® OCR, Scansoft OCR, and Abbyy® OCR) results using voting algorithms will get higher accuracy rate than each of the engines individually. While a voting algorithm has been realized, several aspects to automate and improve the accuracy rate needed further research. This thesis will focus on morphological image preprocessing and morphological text restoration that goes to OCR engines. This method is similar to the one used in restoration partial finger prints. Series of morphological dilating and eroding filters of various mask shapes and sizes were applied to text of different font sizes and types with various noises added. These images were then processed by the OCR engines, and based on these results successful combinations of text, noise, and filters were chosen. The thesis will also deal with the problem of text alignment. Each OCR engine has its own way of dealing with noise and corrupted characters; as a result, the output texts of OCR engines have different lengths and number of words. This in turn, makes it impossible to use spaces a delimiter as a method to separate the words for processing by the voting part of the system. Text aligning determines, using various techniques, what is an extra word, what is supposed to be two or more words instead of one, which words are missing in one document compared to the other, etc. Alignment algorithm is made up of a series of shifts in the two texts to determine which parts are similar and which are not. Since errors made by OCR engines are due to visual misrecognition, in addition to simple character comparison (equal or not), a technique was developed that allows comparison of characters based on how they look.
Show less - Date Issued
- 2006
- Identifier
- CFE0001060, ucf:46799
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001060