☆ 4.5 Article

Evaluating Various Tokenizers for Arabic Text Classification

NEURAL PROCESSING LETTERS (2023)

期刊

NEURAL PROCESSING LETTERS

卷 55, 期 3, 页码 2911-2933

出版社

SPRINGER

DOI: 10.1007/s11063-022-10990-8

关键词

Text Tokenization; Arabic NLP; Text Classification; Sentiment Analysis; Poem-meter Classification

类别

Computer Science, Artificial Intelligence

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces three new tokenization algorithms for Arabic and compares them to three popular tokenizers. The experiments show that no tokenization technique is the best overall choice and that the performance depends on various factors.

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords, which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic, i.e., they do not incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to other three popular tokenizers using unsupervised evaluations. In addition, we compare all the six tokenizers by evaluating them on three supervised classification tasks: sentiment analysis, news classification and poem-meter classification, using six publicly available datasets. Our experiments show that none of the tokenization techniques is the best choice overall and that the performance of a given tokenization algorithm depends on many factors including the size of the dataset, nature of the task, and the morphology richness of the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.

Evaluating Various Tokenizers for Arabic Text Classification

期刊

NEURAL PROCESSING LETTERS

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Evaluating Various Tokenizers for Arabic Text Classification

期刊

NEURAL PROCESSING LETTERS

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文