Dimensionality Reduction and Feature Selection Methods for Script Identification on Document Images

Bruce Poon, Rahman Saami, M. Ashraful Amin, Hong Yan

Abstract


The goal of this research is to explore effects of dimensionality reduction and feature selection on the problem of script identification from images of printed documents. The k-adjacent segment is ideal for this use due to its ability to capture visual patterns. We have used principle component analysis to reduce the size of our feature matrix to a handier size that can be trained easily, and experimented by including varying combinations of dimensions of the super feature set. A modular approach in neural network was used to classify 7 languages - Arabic, Chinese, English, Japanese, Tamil, Thai and Korean.

Keywords


Feature Reduction; Feature Selection; Neural Networks; Principle Component Analysis; Script Identification

References


R. Shwarz, J. Makhoul, and I. Bazzi, "An Omnifont Open-Vocabulary OCR System for English and Arabic," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 495-504, 1999.

A. L. Spitz, "Determination of the Script and Language Content of Document Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, 1997.

W. Boles, S. Sridharan, and A. Busch, "Texture for Script Identification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1720-1733, 2005.

K. Bowers, M. Cannon, P. Kelly, and J. Hochberg, "Script and Language Identification for Handwritten Document Images," Los Alamos National Laboratory, Los Alamos, 1997.

T. Tan, "Rotation Invariant Texture Features and Their Use in Automatic Script Identification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 751-756, 1998.

P. Kelly, T. Thomas, L. Kerns, and J. Hochberg, "Automatic Script Identification from Document Images Using Cluster-Based Templates," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 176-81, 1997.

J. Canny, "A Computational Approach To Edge Detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679-698, 1986.

L. Fevrier, F. Jurie, C. Schmid, and V. Ferrari, "Groups of adjacent contour segments for object detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 1, pp. 36-61, 2008.

L. I. Smith, "A Tutorial on Principal Components Analysis," Cornell University, 2002. [Online]. Available: http://www.sccg.sk/~haladova/principal_components.pdf

L. J. Williams and H. Abdi., "Principal Component Analysis," Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, pp. 433-459, 2010.

S. F. Miskhat, M. Ridwan, E. Chowdhury, S. Rahman, M. A. Amin, "Profound Impact of Artificial Neural Networks and Gaussian SVM Kernel on Distinctive Feature Set for Offline Signature Verification," in Proceedings of the International Conference on Infomatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, pp. 940-945, 2012.

K. Cannon and V. Cheung, University of Toronto, Mar 2012. [Online]. Available: http://www.psi.toronto.edu/~vincent/research/presentations/PNN.pdf

H. Irani, H. R. Pourreza, and O. Mirzaei. "Offline Signature Recognition using Modular Neural Networks with Fuzzy Response Integration," in Proceedings of the 2011 International Conference on Network and Electronics Engineering, vol. 11, Kuala Lumpur, Malaysia, pp. 53-59, 2011.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

IT in Innovation IT in Business IT in Engineering IT in Health IT in Science IT in Design IT in Fashion

IT in Industry (2012 - ) http://www.it-in-industry.com ISSN (Online): 2203-1731; ISSN (Print): 2204-0595