☆ 4.5 Article

Novel approaches to crawling important pages early

KNOWLEDGE AND INFORMATION SYSTEMS (2012)

期刊

KNOWLEDGE AND INFORMATION SYSTEMS

卷 33, 期 3, 页码 707-734

出版社

SPRINGER LONDON LTD

DOI: 10.1007/s10115-012-0535-4

关键词

Web crawler; Crawl ordering; PageRank; Fractional PageRank

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems

资金

National Research Foundation of Korea (NRF)
Ministry of Education, Science and Technology [2012M3C4A7033344, 2011-0010325]
National Research Foundation of Korea [2011-0010325] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.

Novel approaches to crawling important pages early

期刊

KNOWLEDGE AND INFORMATION SYSTEMS

出版社

SPRINGER LONDON LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Novel approaches to crawling important pages early

期刊

KNOWLEDGE AND INFORMATION SYSTEMS

出版社

SPRINGER LONDON LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文