☆ 4.7 Article

Using semi-structured data for assessing research paper similarity

INFORMATION SCIENCES (2013)

Journal

INFORMATION SCIENCES

Volume 221, Issue -, Pages 245-261

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2012.09.044

Keywords

Document similarity; Semi-structured document; Language modeling; Latent Dirichlet Allocation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The task of assessing the similarity of research papers is of interest in a variety of application contexts. It is a challenging task, however, as the full text of the papers is often not available, and similarity needs to be determined based on the papers' abstract, and some additional features such as their authors, keywords, and the journals in which they were published. Our work explores several methods to exploit this information, first by using methods based on the vector space model and then by adapting language modeling techniques to this end. In the first case, in addition to a number of standard approaches we experiment with the use of a form of explicit semantic analysis. In the second case, the basic strategy we pursue is to augment the information contained in the abstract by interpolating the corresponding language model with language models for the authors, keywords and journal of the paper. This strategy is then extended by revealing the latent topic structure of the collection using an adaptation of Latent Dirichlet Allocation, in which the keywords that were provided by the authors are used to guide the process. Experimental analysis shows that a well-considered use of these techniques significantly improves the results of the standard vector space model approach. (C) 2012 Elsevier Inc. All rights reserved.

Using semi-structured data for assessing research paper similarity

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Using semi-structured data for assessing research paper similarity

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper