期刊
KNOWLEDGE AND INFORMATION SYSTEMS
卷 33, 期 3, 页码 707-734出版社
SPRINGER LONDON LTD
DOI: 10.1007/s10115-012-0535-4
关键词
Web crawler; Crawl ordering; PageRank; Fractional PageRank
资金
- National Research Foundation of Korea (NRF)
- Ministry of Education, Science and Technology [2012M3C4A7033344, 2011-0010325]
- National Research Foundation of Korea [2011-0010325] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据