Monday, October 20, 2008

Structure in On-line Documents

Authors: Anil K. Jain and Anoop M. Namboodiri, Jayashree Subrahmonia

Comments:

Summary:
This paper discusses an approach to classify text and shape in online documents.
A linear decision boundary can be used to classify text and non text strokes using stroke length and stroke curvature details.

Grouping non text strokes : A minimum spanning tree is used to group the non text strokes. Strokes are nodes and the shortest distance between them are the edge weights. Inconsistent edges are removed from the spanning tree. Maximum inter region distance of 20 and intra region distance of 200 is imposed on the regions.

Tables - Ruled and unruled

Ruled tables - Ruled tables can be found using te hough transform(r,theta). theta = 0/90 there would be peaks. ruled tables can be found using the condition that it would contain atleast 5 lines.

Unruled tables - Projection on one axis give peaks for all the lines and projection on other axis gives peak with valleys( gaps between the columns).

Discussion:
Interesting features to classify tables. I did not understand how the spanning tree and clustering would help in classification. May be some of the reference papers would be helpful in understanding this.

No comments: