Resumable Mode — WebCollector Tutorial

by briefcopy · Published 2016年9月2日 · Updated 2016年12月11日

What is resumable mode?

Resumable mode makes it possible to resume a crawl that has terminated either expectedly or unexpectedly. In other words, the crawler would start crawling with the history data generated by the previously stopped crawl.

By default, the resumable mode is disabled. If you haven’t enabled the resumable mode, the history data — which stores the information about which urls have been successfully fetched and which are not fetched yet — will be deleted at the beginning of the Crawler.start(int round) method, which is used to start the crawler. Thus the restarted crawler would ignore the history information generated by the previous crawl, fetching webpages that have already been downloaded before.

For example, a BreadthCrawler instance uses a specified folder to store the history data. In non-resumable mode, that folder will be deleted everytime you call the Crawler.start(int round) method. As soon as the folder is deleted, a new folder that contains empty history data will be created to replace the previous folder, providing history manager function for the crawler. The BreadthCrawler instance will then inject seeds into the empty history data and start the iterative crawling processes. The history data created by the crawling processes will be cleared once the Crawler.start(int round) method is involved. As a result, the BreadthCrawler instance starts a completely new crawling task everytime you call the Crawler.start(int round) method.

How to enable resumable mode?

To enable resumable mode, just add crawler.setResumable(true) before you start the crawling task:


Crawler crawler;
...
crawler.setResumable(true);
crawler.start(xxx);

Notice

There are a few things to mention about resumable mode:

Notice that if you involve the Crawler.start(int round) method in non-resumable mode, all your history data would be deleted. Make sure your crawler is always in resumable mode if you don’t want to lose your history data.
Resumable mode is not applicable to RamCrawler.
Make sure your crawler uses the same crawlpath as the previous crawling task.

Resumable Mode — WebCollector Tutorial

What is resumable mode?

How to enable resumable mode?

Notice

You may also like...

发表回复取消回复

分类

近期文章

归档

加QQ群下载完整Eclipse项目

精品资源

友情链接

Resumable Mode — WebCollector Tutorial

What is resumable mode?

How to enable resumable mode?

Notice

You may also like...

Java开源爬虫框架WebCollector爬取CSDN博客

Java开源爬虫框架WebCollector网页正文提取

Java开源爬虫框架WebCollector设置代理

发表回复 取消回复

分类

近期文章

归档

标签

加QQ群下载完整Eclipse项目

精品资源

友情链接

发表回复取消回复