☆ 4.7 Article

Statistical learning for OCR error correction

INFORMATION PROCESSING & MANAGEMENT (2018)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 54, Issue 6, Pages 874-887

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2018.06.001

Keywords

OCR post-processing; OCR error; Error correction; Statistical learning

Categories

Computer Science, Information Systems Information Science & Library Science

Funding

Social Sciences and Humanities Research Council of Canada (SSHRC) [RGPDD 451330, RGPIN 130082, RGPIN 06183]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Theory & Methods

Survey of Post-OCR Processing Approaches

Thi Tuyet Hai Nguyen, Adam Jatowt, Mickael Coustaty, Antoine Doucet

Summary: The article highlights the importance of improving the quality of OCR results, as historical materials often perform poorly in OCR processing and require post-correction. It defines the postOCR processing problem, describes its typical pipeline, and reviews the latest post-OCR processing methods, along with discussing evaluation metrics, accessible datasets, language resources, and toolkits. Additionally, the work identifies the current trend and outlines research directions in this field.

ACM COMPUTING SURVEYS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham Neubig

Summary: The paper introduces a semi-supervised learning method to improve OCR system performance by utilizing raw images and self-training, and introduces a lexically aware decoding method. Results show that self-training and lexically aware decoding are essential for achieving consistent improvements.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Add to Collection

Article Computer Science, Information Systems

OCR post-correction for detecting adversarial text images

Niddal H. Imam, Vassilios G. Vassilakis, Dimitris Kolovos

Summary: This paper presents an OCR-based system for detecting images with embedded text shared on social networks and proposes an OCR post-correction algorithm to improve the system's robustness. Experimental results demonstrate the effectiveness of the algorithm in detecting and correcting adversarial text images, leading to improved performance of the OCR system.

JOURNAL OF INFORMATION SECURITY AND APPLICATIONS (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Towards improving speech recognition model with post-processing spell correction using BERT

M. C. Shunmuga Priya, D. Karthika Renuka, L. Ashok Kumar

Summary: Speech recognition is widely used but still faces the challenge of spell errors. This research proposes a BERT-based spell correction module to enhance ASR system performance. Experimental results demonstrate the efficacy of this module in detecting and correcting spell errors.

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS (2022)

Add to Collection

Review Chemistry, Multidisciplinary

Analysis of Recent Deep Learning Techniques for Arabic Handwritten-Text OCR and Post-OCR Correction

Rayyan Najam, Safiullah Faizullah

Summary: Arabic handwritten-text recognition uses OCR and text-correction techniques for accurate text extraction from images. Deep learning has been widely used in OCR, but recent deep-learning techniques for Arabic handwritten OCR and text correction have not been adequately studied or analyzed. This analysis fills this gap by uncovering recent developments and limitations, providing valuable insights for researchers, practitioners, and interested readers. The study finds that CNN-LSTM-CTC is the most suitable architecture for OCR, and DL models improve accuracy in OCR text correction. The study highlights the potential for applying text-embedding models to correct OCR results in Arabic OCR and emphasizes the need for high-quality datasets and future research in this area.

APPLIED SCIENCES-BASEL (2023)

Add to Collection

Article Chemistry, Multidisciplinary

Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction

Wei Gou, Zheng Chen

Summary: Chinese Spelling Error Correction is a hot topic in natural language processing, with many solutions from rule-based to deep learning methods. Although SpellGCN has achieved the best results, it produces many false error correction results in practical tasks. The proposed post-processing method aims to improve performance by filtering out these false results.

APPLIED SCIENCES-BASEL (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Neural OCR Post-Hoc Correction of Historical Corpora

Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu

Summary: Optical character recognition is crucial for accessing historical collections, but it faces challenges such as orthographic variations and language evolution leading to transcription errors. A neural network approach is proposed to correct OCR errors, which significantly reduces the word error rate.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Add to Collection

Article Engineering, Civil

A novel attention-based LSTM cell post-processor coupled with bayesian optimization for streamflow prediction

Babak Alizadeh, Alireza Ghaderi Bafti, Hamid Kamangir, Yu Zhang, Daniel B. Wright, Kristie J. Franz

Summary: The study introduces a novel deep learning model SAINA-LSTM, which improves streamflow forecasting performance by incorporating attention mechanism into LSTM cells. SAINA-LSTM outperforms other models in various climatological basins and for 1- to 7-day ahead forecasts in different flow ranges.

JOURNAL OF HYDROLOGY (2021)

Add to Collection

Article Mathematics

Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology

Vasyl Lytvyn, Petro Pukach, Victoria Vysotska, Myroslava Vovk, Nataliia Kholodna

Summary: A machine learning model has been developed to correct errors in Ukrainian texts. The neural network has the ability to correct simple sentences in Ukrainian, but a complete system requires the use of spell-checking dictionaries and rule checking. A pre-trained BERT neural network was used to save computing resources and showed satisfactory results in correcting grammatical and stylistic errors. Among the pre-trained models, the mT5 model performed the best according to BLEU and METEOR metrics.

MATHEMATICS (2023)

Add to Collection

Article Chemistry, Multidisciplinary

Closed-Loop Error-Correction Learning Accelerates Experimental Discovery of Thermoelectric Materials

Hitarth Choubisa, Md Azimul Haque, Tong Zhu, Lewei Zeng, Maral Vafaie, Derya Baran, Edward H. Sargent

Summary: The exploration of thermoelectric materials is challenging due to the large materials space and the complexity of synthesis. By incorporating historical data and using error-correction learning, this study discovers a previously unexplored family of thermoelectric materials and finds an optimized material with significantly improved power factor. It is observed that a closed-loop experimentation strategy reduces the required number of experiments by up to 3 times compared to high-throughput searches powered by state-of-the-art machine-learning models.

ADVANCED MATERIALS (2023)

Add to Collection

Article Physics, Multidisciplinary

Learning Logical Pauli Noise in Quantum Error Correction

Thomas Wagner, Hermann Kampermann, Dagmar Bruss, Martin Kliesch

Summary: The characterization of quantum devices is important but costly. This study focuses on the characterization of quantum computers in the context of stabilizer quantum error correction. It is shown that the logical error channel induced by Pauli noise can be estimated from syndrome data under minimal conditions for different types of codes.

PHYSICAL REVIEW LETTERS (2023)

Add to Collection

Article Engineering, Electrical & Electronic

An OCR Post-Correction Approach Using Deep Learning for Processing Medical Reports

Srinidhi Karthikeyan, Alba G. Seco de Herrera, Faiyaz Doctor, Asim Mirza

Summary: The COVID-19 pandemic has placed a significant burden on the global healthcare sector, driving digital transformation efforts for improved efficiency. The generation of medical data has increased dramatically, with much of it being unstructured and stored as part of patients' medical reports. Optical Character Recognition (OCR) is used to digitize this unstructured data, but OCR engines often struggle with accurately transcribing scanned or handwritten documents. The proposed method utilizes a deep neural network pre-training technique called RoBERTa to predict and fill in the gaps in non-transcribable sections of the documents. Evaluation on domain-specific datasets, including real medical documents, demonstrates a significantly reduced word error rate and showcases the effectiveness of this approach.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

A Comprehensive Survey of Grammatical Error Correction

Yu Wang, Yuelin Wang, Kai Dang, Jie Liu, Zhuo Liu

Summary: This study provides a comprehensive review of the literature in the field of grammatical error correction (GEC), covering task definition, basic approaches, performance boosting techniques, data augmentation methods, and evaluation results. Emphasis is placed on approaches related to machine translation, with an analysis of error types and system advancements for a clear view of progress in GEC. Future research directions in GEC are also discussed.

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY (2021)

Add to Collection

Article Chemistry, Multidisciplinary

Automatic Correction of Indonesian Grammatical Errors Based on Transformer

Ahmad Musyafa, Ying Gao, Aiman Solyman, Chaojie Wu, Siraj Khan

Summary: This paper proposes an automatic model for Indonesian grammar correction based on the Transformer architecture, addressing the lack of research on the GEC task for low-resource languages (especially Indonesian). It also builds a large corpus of the Indonesian language for evaluating future Indonesian GEC tasks. Experimental results demonstrate significant and satisfactory performance of the Transformer-based automatic error correction model.

APPLIED SCIENCES-BASEL (2022)

Add to Collection

Article Neurosciences

Dynamics of nonlinguistic statistical learning: From neural entrainment to the emergence of explicit knowledge

Julia Moser, Laura Batterink, Yiwen Li Hegner, Franziska Schleger, Christoph Braun, Ken A. Paller, Hubert Preissl

Summary: Humans are highly sensitive to patterns in the environment and use statistical learning for cognition. This study examined the neural mechanisms of statistical learning using an auditory nonlinguistic paradigm. Neural entrainment reflects implicit learning of patterns, while the emergence of explicit knowledge varies across individuals depending on factors such as attention and exposure time.

NEUROIMAGE (2021)

Add to Collection

No Data Available

Article Computer Science, Information Systems

The social-technological ways to develop digital entrepreneurship: Targeting value creation and value capture

Sang-Bing Tsai, Xusen Cheng, Yanwu Yang, Jason Xiong, Alex Zarifis

Summary: This article structurally concludes the methods proposed and evidenced to develop digital entrepreneurship from a socio-technical perspective. The technology itself and the process of utilization should be carefully considered. From a social perspective, fulfilling the needs of customers in social interaction and nurturing characteristics and social skills for the digital work environment are crucial.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

NSEP: Early fake news detection via news semantic environment perception

Xiaochang Fang, Hongchen Wu, Jing Jing, Yihong Meng, Bing Yu, Hongzhu Yu, Huaxiang Zhang

Summary: This study proposes a novel fake news detection framework, utilizing news semantic environment perception (NSEP) to identify fake news content. The framework consists of steps such as dividing the semantic environment into macro and micro levels, applying graph convolutional networks, and utilizing multihead attention. Empirical experiments show that the NSEP framework achieves high accuracy in detecting Chinese fake news, outperforming other baseline methods and highlighting the importance of both micro and macro semantic environments in early detection of fake news.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

A scalable and flexible basket analysis system for big transaction data in Spark

Xudong Sun, Alladoumbaye Ngueilbaye, Kaijing Luo, Yongda Cai, Dingming Wu, Joshua Zhexue Huang

Summary: This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm to address the data scalability and flexibility issues in basket analysis in the big data era. Experiment results demonstrate that the ScaDistFIM algorithm is more efficient compared to the Spark FP-Growth algorithm.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

A T5-based interpretable reading comprehension model with more accurate evidence training

Boxu Guan, Xinhua Zhu, Shangbo Yuan

Summary: This paper aims to improve the interpretability of machine reading comprehension models by utilizing the pre-trained T5 model for evidence inference. They propose an interpretable reading comprehension model based on T5, which is trained on a more accurate evidence corpus and can infer precise interpretations for answers. Experimental results show that their model outperforms the baseline BERT model on the SQuAD1.1 task.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

STMAP: A novel semantic text matching model augmented with embedding perturbations

Yanhao Wang, Baohua Zhang, Weikang Liu, Jiahao Cai, Huaping Zhang

Summary: In this study, we propose a data augmentation-based semantic text matching model called STMAP. By using Gaussian noise and noise mask signal for data augmentation, as well as employing an adaptive optimization network for training target optimization, our model achieves good performance in few-shot learning and semantic deviation problems.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

An efficient loss function and deep learning approach for ranking stock returns in the absence of prior knowledge

Jiahao Yang, Shuo Feng, Wenkai Zhang, Ming Zhang, Jun Zhou, Pengyuan Zhang

Summary: To pursue profit from stock markets, researchers utilize deep learning methods to forecast asset price movements. However, there are two issues in current research, the discrepancy between forecasting results and profits, and heavy reliance on prior knowledge. To address these issues, researchers propose a novel optimization objective and modeling method, and conduct experiments to validate their approach.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Revealing the technology development of natural language processing: A Scientific entity-centric perspective

Heng Zhang, Chengzhi Zhang, Yuzhuo Wang

Summary: This study provides an accurate analysis of technology development in the field of Natural Language Processing (NLP) from an entity-centric perspective. The findings indicate an increase in the average number of entities per paper, with pre-trained language models becoming mainstream and the impact of Wikipedia dataset and BLEU metric continuing to rise. There has been a surge in popularity for new high-impact technologies in recent years, with researchers accepting them at an unprecedented speed.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Citation prediction by leveraging transformers and natural language heuristics

Davide Buscaldi, Danilo Dessi, Enrico Motta, Marco Murgia, Francesco Osborne, Diego Reforgiato Recupero

Summary: In scientific papers, citing other articles is a common practice to support claims and provide evidence. This paper proposes two automatic methods using Transformer models to address citation placement, and achieves significant improvements in experiments.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Data-driven analysis of digital entrepreneurship in medical supply resilience confronting the COVID-19 epidemic

Baozhuang Niu, Lingfeng Wang, Xinhu Yu, Beibei Feng

Summary: This paper examines whether the incumbent brand should adopt digital technology to forecast demand and adjust order decisions in the face of soaring demand for medical supply caused by frequent outbreaks of regional COVID-19 epidemic. The study finds that digital transformation can lead to a triple-win situation among the incumbent brand, social welfare, and consumer surplus, as well as bring benefits to the manufacturer. Furthermore, the research provides insights for firms' digital entrepreneurship decisions through theoretical optimization and data processing/policy simulation.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

Xueyang Qin, Lishang Li, Fei Hao, Meiling Ge, Guangyao Pang

Summary: Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Review Computer Science, Information Systems

A co-attention based multi-modal fusion network for review helpfulness prediction

Gang Ren, Lei Diao, Fanjia Guo, Taeho Hong

Summary: This paper proposes a novel approach for predicting the helpfulness of reviews by utilizing both textual and image features. The proposed method considers the correlation between features through self-attention and co-attention mechanisms, and fuses multi-modal features for prediction. Experimental results demonstrate the superior performance of the proposed method compared to benchmark methods.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Retrieval Contrastive Learning for Aspect-Level Sentiment Classification

Zhongquan Jian, Jiajian Li, Qingqiang Wu, Junfeng Yao

Summary: Aspect-Level Sentiment Classification (ALSC) is a crucial challenge in Natural Language Processing (NLP). Most existing methods fail to consider the correlations between different instances, leading to a lack of global viewpoint. To address this issue, we propose a Retrieval Contrastive Learning (RCL) framework that extracts intrinsic knowledge across instances for improved instance representation. Experimental results demonstrate that training ALSC models with RCL leads to substantial performance improvements.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

A hierarchical convolutional model for biomedical relation extraction

Ying Hu, Yanping Chen, Ruizhang Huang, Yongbin Qin, Qinghua Zheng

Summary: Biomedical relation extraction aims to extract the interactive relations between biomedical entities in a sentence. This study proposes a hierarchical convolutional model to address the semantic overlapping and data imbalance problems. The model encodes both local contextual features and global semantic dependencies, enhancing the discriminability of the neural network for biomedical relation extraction.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

Topic Audiolization: A Model for Rumor Detection Inspired by Lie Detection Technology

Zhou Yang, Yucai Pang, Xuehong Li, Qian Li, Shihong Wei, Rong Wang, Yunpeng Xiao

Summary: This study proposes a rumor detection model based on topic audiolization, which transforms the topic space into audio-like signals. Experimental results show that the model achieves significant performance improvements in rumor identification.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

Article Computer Science, Information Systems

User-oriented metrics for search engine deterministic sort orders

Alistair Moffat

Summary: This paper proposes the buying power metric for assessing the quality of product rankings on e-commerce sites. It discusses the relationship between the buying power metric and user reactions, and introduces an alternative product ranking effectiveness metric.

INFORMATION PROCESSING & MANAGEMENT (2024)

Add to Collection

© Peeref 2019-2024. All rights reserved.