☆ 4.7 Article

A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2011)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 23, Issue 7, Pages 961-976

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2010.27

Keywords

Web mining; hidden topic analysis; sparse data; classification; matching; ranking; contextual advertising

Funding

Japan Society for Promotion of Science (JSPS) [P06366]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

This paper introduces a hidden topic-based framework for processing short and sparse documents (e. g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.

A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper