A new framework for recognition of heavily degraded characters in historical typewritten documents based on semi-supervised clustering

S. Pletschacher; J. Hu; A. Antonacopoulos

doi:10.1109/ICDAR.2009.267

ICDAR 2009

Conference paper

10 Dec 2009

A new framework for recognition of heavily degraded characters in historical typewritten documents based on semi-supervised clustering

View publication

Abstract

This paper presents a new semi-supervised clustering framework to the recognition of heavily degraded characters in historical typewritten documents, where off-the-shelf OCR typically fails. The constraints are generated using typographical (collection-independent) domain knowledge and are used to guide both sample (glyph set) partitioning and metric learning. Experimental results using simple features provide encouraging evidence that this approach can lead to significantly improved clustering results compared to simple K-Means clustering, as well as to clustering using a state-of-the art OCR engine. © 2009 IEEE.

Conference paper