4.5 Article

Novel approaches to crawling important pages early

期刊

KNOWLEDGE AND INFORMATION SYSTEMS
卷 33, 期 3, 页码 707-734

出版社

SPRINGER LONDON LTD
DOI: 10.1007/s10115-012-0535-4

关键词

Web crawler; Crawl ordering; PageRank; Fractional PageRank

资金

  1. National Research Foundation of Korea (NRF)
  2. Ministry of Education, Science and Technology [2012M3C4A7033344, 2011-0010325]
  3. National Research Foundation of Korea [2011-0010325] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据