4.5 Article

Evaluating Various Tokenizers for Arabic Text Classification

期刊

NEURAL PROCESSING LETTERS
卷 55, 期 3, 页码 2911-2933

出版社

SPRINGER
DOI: 10.1007/s11063-022-10990-8

关键词

Text Tokenization; Arabic NLP; Text Classification; Sentiment Analysis; Poem-meter Classification

向作者/读者索取更多资源

This paper introduces three new tokenization algorithms for Arabic and compares them to three popular tokenizers. The experiments show that no tokenization technique is the best overall choice and that the performance depends on various factors.
The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords, which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic, i.e., they do not incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to other three popular tokenizers using unsupervised evaluations. In addition, we compare all the six tokenizers by evaluating them on three supervised classification tasks: sentiment analysis, news classification and poem-meter classification, using six publicly available datasets. Our experiments show that none of the tokenization techniques is the best choice overall and that the performance of a given tokenization algorithm depends on many factors including the size of the dataset, nature of the task, and the morphology richness of the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据