Document Zone Content Classification and Its Performance Evaluation

Yalin Wang, Ihsin T. Phillips, and Robert M. Haralick


Abstract

This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing datasets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Figures (click on each for a larger version):


Related Publications