4.6 Article

A survey of multilingual human-tagged short message datasets for sentiment analysis tasks

Journal

SOFT COMPUTING
Volume 22, Issue 24, Pages 8227-8242

Publisher

SPRINGER
DOI: 10.1007/s00500-017-2766-5

Keywords

Sentiment analysis; Dataset; Corpus; Short messages; Multilingual; Twitter; Human-tagged

Funding

  1. Coordination of Improvement of Higher Education, CAPES-Brazil [BEX 2230/15-1]
  2. Andalusian Excellence Projects [P10-SEJ-6768]
  3. Spanish National Project [TIN2013-40658-P]

Ask authors/readers for more resources

Today, the electronic word-of-mouth (eWOM) statements expressed on blogs, social media or shopping platforms are much frequent and enable customers to share his/her point of view about acquired products or services. These eWOM statements can be used for the industry to improve its products and services and for customers for making better purchase decisions. Sentiment analysis (SA) techniques can be used to extract and analyze these eWOM statements. Research in recent years on SA has advanced considerably, and its applications in business management have grown exponentially. Automatic techniques (such as machine learning, deep learning and statistic approaches) have been used for this purpose. However, training a machine for processing or analyzing sentiments is a hard task, mainly due to the complexity of the natural language. This task is more complicated in multilingual environments. There is still a great paucity regarding training datasets, one of the key resources in achieving more favorable results. Training datasets, in fact, are a reservoir of information serving to teach and refine the skills of automatic techniques. Hence, the higher the quality of the training datasets, the better predictive power of sentiment analysis tasks. English datasets are relatively easy to find in the literature; however, datasets in other languages are very scarce. So, this paper therefore describes and compiles information concerning 25 datasets gleaned from short messages (statements expressed in social media and shopping platforms) in seven different languages, for the most part from Twitter. For quality issues, all the resources were human-tagged, and they are currently available to the scientific community. A new sentiment dataset in English extracted from Twitter has also been drawn up and each message evaluated subjectively. The current survey therefore aims to provide essential quality information for future research related to automatic sentiment analysis in monolingual or multilingual scenarios.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available