Document Zone Content Classification and Its Performance Evaluation

Yalin Wang, Ihsin T. Phillips, and Robert M. Haralick


This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing datasets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Figures (click on each for a larger version):

Related Publications