☆ 4.6 Article

Building an efficient OCR system for historical documents with little training data

NEURAL COMPUTING & APPLICATIONS (2020)

期刊

NEURAL COMPUTING & APPLICATIONS

卷 32, 期 23, 页码 17209-17227

出版社

SPRINGER LONDON LTD

DOI: 10.1007/s00521-020-04910-x

关键词

CNN; FCN; Historical documents; LSTM; Neural network; OCR; Porta fontium; Synthetic data

类别

Computer Science, Artificial Intelligence

资金

ERDF Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom) [CZ.02.1.01/0.0/0.0/17_048/0007267]
Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014-2020 [211]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

作者

我是这篇论文的作者

点击您的名字以认领此论文并将其添加到您的个人资料中。

主要评分

4.6

评分不足

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

HDPA: historical document processing and analysis framework

Ladislav Lenc, Jiri Martinek, Pavel Kral, Anguelos Nicolao, Vincent Christlein

Summary: This paper describes a complex and flexible web framework for historical document manipulation and analysis with a focus on OCR. The framework contains eight modules to facilitate three main tasks. Experimental evaluation shows that the system is efficient and can save human labor.

EVOLVING SYSTEMS (2021)