4.7 Article

Statistical learning for OCR error correction

Journal

INFORMATION PROCESSING & MANAGEMENT
Volume 54, Issue 6, Pages 874-887

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2018.06.001

Keywords

OCR post-processing; OCR error; Error correction; Statistical learning

Funding

  1. Social Sciences and Humanities Research Council of Canada (SSHRC) [RGPDD 451330, RGPIN 130082, RGPIN 06183]

Ask authors/readers for more resources

Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
Article Computer Science, Information Systems

The social-technological ways to develop digital entrepreneurship: Targeting value creation and value capture

Sang-Bing Tsai, Xusen Cheng, Yanwu Yang, Jason Xiong, Alex Zarifis

Summary: This article structurally concludes the methods proposed and evidenced to develop digital entrepreneurship from a socio-technical perspective. The technology itself and the process of utilization should be carefully considered. From a social perspective, fulfilling the needs of customers in social interaction and nurturing characteristics and social skills for the digital work environment are crucial.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

NSEP: Early fake news detection via news semantic environment perception

Xiaochang Fang, Hongchen Wu, Jing Jing, Yihong Meng, Bing Yu, Hongzhu Yu, Huaxiang Zhang

Summary: This study proposes a novel fake news detection framework, utilizing news semantic environment perception (NSEP) to identify fake news content. The framework consists of steps such as dividing the semantic environment into macro and micro levels, applying graph convolutional networks, and utilizing multihead attention. Empirical experiments show that the NSEP framework achieves high accuracy in detecting Chinese fake news, outperforming other baseline methods and highlighting the importance of both micro and macro semantic environments in early detection of fake news.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

A scalable and flexible basket analysis system for big transaction data in Spark

Xudong Sun, Alladoumbaye Ngueilbaye, Kaijing Luo, Yongda Cai, Dingming Wu, Joshua Zhexue Huang

Summary: This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm to address the data scalability and flexibility issues in basket analysis in the big data era. Experiment results demonstrate that the ScaDistFIM algorithm is more efficient compared to the Spark FP-Growth algorithm.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

A T5-based interpretable reading comprehension model with more accurate evidence training

Boxu Guan, Xinhua Zhu, Shangbo Yuan

Summary: This paper aims to improve the interpretability of machine reading comprehension models by utilizing the pre-trained T5 model for evidence inference. They propose an interpretable reading comprehension model based on T5, which is trained on a more accurate evidence corpus and can infer precise interpretations for answers. Experimental results show that their model outperforms the baseline BERT model on the SQuAD1.1 task.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

STMAP: A novel semantic text matching model augmented with embedding perturbations

Yanhao Wang, Baohua Zhang, Weikang Liu, Jiahao Cai, Huaping Zhang

Summary: In this study, we propose a data augmentation-based semantic text matching model called STMAP. By using Gaussian noise and noise mask signal for data augmentation, as well as employing an adaptive optimization network for training target optimization, our model achieves good performance in few-shot learning and semantic deviation problems.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

An efficient loss function and deep learning approach for ranking stock returns in the absence of prior knowledge

Jiahao Yang, Shuo Feng, Wenkai Zhang, Ming Zhang, Jun Zhou, Pengyuan Zhang

Summary: To pursue profit from stock markets, researchers utilize deep learning methods to forecast asset price movements. However, there are two issues in current research, the discrepancy between forecasting results and profits, and heavy reliance on prior knowledge. To address these issues, researchers propose a novel optimization objective and modeling method, and conduct experiments to validate their approach.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Revealing the technology development of natural language processing: A Scientific entity-centric perspective

Heng Zhang, Chengzhi Zhang, Yuzhuo Wang

Summary: This study provides an accurate analysis of technology development in the field of Natural Language Processing (NLP) from an entity-centric perspective. The findings indicate an increase in the average number of entities per paper, with pre-trained language models becoming mainstream and the impact of Wikipedia dataset and BLEU metric continuing to rise. There has been a surge in popularity for new high-impact technologies in recent years, with researchers accepting them at an unprecedented speed.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Citation prediction by leveraging transformers and natural language heuristics

Davide Buscaldi, Danilo Dessi, Enrico Motta, Marco Murgia, Francesco Osborne, Diego Reforgiato Recupero

Summary: In scientific papers, citing other articles is a common practice to support claims and provide evidence. This paper proposes two automatic methods using Transformer models to address citation placement, and achieves significant improvements in experiments.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Data-driven analysis of digital entrepreneurship in medical supply resilience confronting the COVID-19 epidemic

Baozhuang Niu, Lingfeng Wang, Xinhu Yu, Beibei Feng

Summary: This paper examines whether the incumbent brand should adopt digital technology to forecast demand and adjust order decisions in the face of soaring demand for medical supply caused by frequent outbreaks of regional COVID-19 epidemic. The study finds that digital transformation can lead to a triple-win situation among the incumbent brand, social welfare, and consumer surplus, as well as bring benefits to the manufacturer. Furthermore, the research provides insights for firms' digital entrepreneurship decisions through theoretical optimization and data processing/policy simulation.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

Xueyang Qin, Lishang Li, Fei Hao, Meiling Ge, Guangyao Pang

Summary: Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.

INFORMATION PROCESSING & MANAGEMENT (2024)

Review Computer Science, Information Systems

A co-attention based multi-modal fusion network for review helpfulness prediction

Gang Ren, Lei Diao, Fanjia Guo, Taeho Hong

Summary: This paper proposes a novel approach for predicting the helpfulness of reviews by utilizing both textual and image features. The proposed method considers the correlation between features through self-attention and co-attention mechanisms, and fuses multi-modal features for prediction. Experimental results demonstrate the superior performance of the proposed method compared to benchmark methods.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Retrieval Contrastive Learning for Aspect-Level Sentiment Classification

Zhongquan Jian, Jiajian Li, Qingqiang Wu, Junfeng Yao

Summary: Aspect-Level Sentiment Classification (ALSC) is a crucial challenge in Natural Language Processing (NLP). Most existing methods fail to consider the correlations between different instances, leading to a lack of global viewpoint. To address this issue, we propose a Retrieval Contrastive Learning (RCL) framework that extracts intrinsic knowledge across instances for improved instance representation. Experimental results demonstrate that training ALSC models with RCL leads to substantial performance improvements.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

A hierarchical convolutional model for biomedical relation extraction

Ying Hu, Yanping Chen, Ruizhang Huang, Yongbin Qin, Qinghua Zheng

Summary: Biomedical relation extraction aims to extract the interactive relations between biomedical entities in a sentence. This study proposes a hierarchical convolutional model to address the semantic overlapping and data imbalance problems. The model encodes both local contextual features and global semantic dependencies, enhancing the discriminability of the neural network for biomedical relation extraction.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

Topic Audiolization: A Model for Rumor Detection Inspired by Lie Detection Technology

Zhou Yang, Yucai Pang, Xuehong Li, Qian Li, Shihong Wei, Rong Wang, Yunpeng Xiao

Summary: This study proposes a rumor detection model based on topic audiolization, which transforms the topic space into audio-like signals. Experimental results show that the model achieves significant performance improvements in rumor identification.

INFORMATION PROCESSING & MANAGEMENT (2024)

Article Computer Science, Information Systems

User-oriented metrics for search engine deterministic sort orders

Alistair Moffat

Summary: This paper proposes the buying power metric for assessing the quality of product rankings on e-commerce sites. It discusses the relationship between the buying power metric and user reactions, and introduces an alternative product ranking effectiveness metric.

INFORMATION PROCESSING & MANAGEMENT (2024)