4.6 Article

Building an efficient OCR system for historical documents with little training data

期刊

NEURAL COMPUTING & APPLICATIONS
卷 32, 期 23, 页码 17209-17227

出版社

SPRINGER LONDON LTD
DOI: 10.1007/s00521-020-04910-x

关键词

CNN; FCN; Historical documents; LSTM; Neural network; OCR; Porta fontium; Synthetic data

资金

  1. ERDF Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom) [CZ.02.1.01/0.0/0.0/17_048/0007267]
  2. Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014-2020 [211]

向作者/读者索取更多资源

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Correction Computer Science, Artificial Intelligence

Building an efficient OCR system for historical documents with little training data (vol 32, pg 17209, 2020)

Jiri Martinek, Ladislav Lenc, Pavel Kral

Summary: With the author(s) choosing Open Choice, the copyright of the article was changed on December 3, 2020 to [The Authors] [2020], and the article was immediately distributed under the terms of copyright.

NEURAL COMPUTING & APPLICATIONS (2022)

Article Computer Science, Interdisciplinary Applications

Are Papers Asking Questions Cited More Frequently in Computer Science?

Dalibor Fiala, Pavel Kral, Martin Dostal

Summary: The study tested the hypothesis that computer science papers with questions in their titles are cited more frequently. The analysis of data from almost two million computer science papers showed that papers with questions receive an average of 20% more citations than other papers, which is statistically significant.

COMPUTERS (2021)

Article Computer Science, Artificial Intelligence

Well-calibrated confidence measures for multi-label text classification with a large number of labels

Lysimachos Maltoudoglou, Andreas Paisios, Ladislav Lenc, Jiri Martinek, Pavel Kral, Harris Papadopoulos

Summary: In this study, the researchers present a novel approach to improve the computational efficiency of Label Powerset Inductive Conformal Prediction in multi-label text classification. Experimental results show that contextualised-based classifiers outperform non-contextualised ones and achieve state-of-the-art performance across all datasets.

PATTERN RECOGNITION (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Text Line Segmentation in Historical Newspapers

Ladislav Lenc, Jiri Martinek, Pavel Kral

Summary: This paper presents a novel approach to page segmentation into text lines, which is used as input for a line-based OCR system. The approach decomposes the problem into text-block and text-line segmentation and employs algorithms based on fully convolutional neural networks. The proposed method is evaluated on standard corpora and a new dataset created from freely available data.

ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2022, PT II (2023)

Proceedings Paper Acoustics

Weak supervision for Question Type Detection with large language models

Jiri Martinek, Christophe Cerisara, Pavel Kral, Ladislav Lenc, Josef Baloun

Summary: Large pre-trained language models have achieved impressive results in zero-shot learning, but it is still challenging to design effective prompts for certain tasks like dialogue act recognition. We propose an alternative approach that replaces manual prompts with simple rules, which are more intuitive and easier to design. Our experiments on question type recognition demonstrate that this approach can achieve competitive performances and we analyze its limitations.

INTERSPEECH 2022 (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Historical Map Toponym Extraction for Efficient Information Retrieval

Ladislav Lenc, Jiri Martinek, Josef Baloun, Martin Prantl, Pavel Kral

Summary: The paper introduces a method for the detection, classification, and recognition of toponyms in hand-drawn historical cadastral maps. The detected and recognized toponyms are used as keywords for intelligent and efficient searching in historical map collections. The paper proposes a novel approach for toponym classification based on the KAZE descriptor and evaluates several state-of-the-art methods for text and object detection on the toponym detection task. Additionally, the paper presents the results of toponym text recognition using the popular Tesseract engine.

DOCUMENT ANALYSIS SYSTEMS, DAS 2022 (2022)

Proceedings Paper Computer Science, Information Systems

Dialogue Act Recognition Using Visual Information

Jiri Martinek, Pavel Kral, Ladislav Lenc

Summary: This paper focuses on dialogue act recognition from printed documents and introduces a novel deep model for visual DA recognition. The study shows that visual information does not impact DA recognition on high-quality images, but significantly improves the score on low-quality images with erroneous OCR. This is the first attempt to focus on DA recognition from visual data.

DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II (2021)

Proceedings Paper Computer Science, Artificial Intelligence

ICDAR 2021 Competition on Historical Map Segmentation

Joseph Chazalon, Edwin Carlinet, Yizi Chen, Julien Perret, Bertrand Dumenieu, Clement Mallet, Thierry Geraud, Vincent Nguyen, Nam Nguyen, Josef Baloun, Ladislav Lenc, Pavel Kral

Summary: This paper presents the final results of the ICDAR 2021 MapSeg competition, which focuses on historical map segmentation of a series of historical atlases of Paris, France. The winning teams used different network structures and methods for each task. The research outcomes have a positive impact on the development of historical map segmentation technology.

DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT IV (2021)

Proceedings Paper Computer Science, Artificial Intelligence

ChronSeg: Novel Dataset for Segmentation of Handwritten Historical Chronicles

Josef Baloun, Pavel Kral, Ladislav Lenc

Summary: This research focuses on the segmentation of historical handwritten documents, specifically chronicles, using a fully convolutional neural network approach. A new dataset was created, consisting of 58 images with precise annotations for text, image, and graphic regions at a pixel level. Multiple experiments were conducted to identify the best method configuration, including a novel data augmentation method.

ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2 (2021)

暂无数据