Java开源爬虫框架WebCollector 2.x入门教程——基本概念

WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本,支持分布式爬取。

目前WebCollector在Github上维护:https://github.com/CrawlScript/WebCollector

1.WebCollector与传统网络爬虫的区别

传统的网络爬虫倾向于整站下载,目的是将网站内容原样下载到本地,数据的最小单元是单个网页或文件。而WebCollector可以通过设置爬取策略进行定向采集,并可以抽取网页中的结构化信息。

2.WebCollector与HttpClient、Jsoup的区别

WebCollector是爬虫框架,HttpClient是Http请求组件,Jsoup是网页解析器(内置了Http请求功能)。

一些程序员在单线程中通过迭代或递归的方法调用HttpClient和Jsoup进行数据采集,这样虽然也可以完成任务,但存在两个较大的问题:

1) 单线程速度慢,多线程爬虫的速度远超单线程爬虫。

2) 需要自己编写任务维护机制。这套机制里面包括了URL去重、断点爬取(即异常中断处理)等功能。

WebCollector框架自带了多线程和URL维护,用户在编写爬虫时无需考虑线程池、URL去重和断点爬取的问题。

3.WebCollector能够处理的量级

WebCollector目前有单机版和Hadoop版(WebCollector-Hadoop),单机版能够处理千万级别的URL,对于大部分的精数据采集任务,这已经足够了。WebCollector-Hadoop能够处理的量级高于单机版,具体数量取决于集群的规模。

4.WebCollector的遍历

WebCollector采用一种粗略的广度遍历,但这里的遍历与网站的拓扑树结构没有任何关系,用户不需要在意遍历的方式。

网络爬虫会在访问页面时,从页面中探索新的URL,继续爬取。WebCollector为探索新URL提供了两种机制,自动解析和手动解析。两种机制的具体内容请读后面实例中的代码注释。

5.WebCollector 2.x版本新特性

  • 1)自定义遍历策略,可完成更为复杂的遍历业务,例如分页、AJAX
  • 2)可以为每个URL设置附加信息(MetaData),利用附加信息可以完成很多复杂业务,例如深度获取、锚文本获取、引用页面获取、POST参数传递、增量更新等。
  • 3)使用插件机制,WebCollector内置两套插件。
  • 4)内置一套基于内存的插件(RamCrawler),不依赖文件系统或数据库,适合一次性爬取,例如实时爬取搜索引擎。
  • 5)内置一套基于Berkeley DB(BreadthCrawler)的插件:适合处理长期和大量级的任务,并具有断点爬取功能,不会因为宕机、关闭导致数据丢失。
  • 6)集成selenium,可以对javascript生成信息进行抽取
  • 7)可轻松自定义http请求,并内置多代理随机切换功能。 可通过定义http请求实现模拟登录。
  • 8)使用slf4j作为日志门面,可对接多种日志
  • 9)新增了WebCollector-Hadoop,支持分布式爬取

6.WebCollector爬取新闻网站实例

自动探测URL

AutoNewsCrawler.java:


import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler; /** * Crawling news from github news * * @author hu */ public class AutoNewsCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public AutoNewsCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*start pages*/ this.addSeed("https://blog.github.com/"); for(int pageIndex = 2; pageIndex <= 5; pageIndex++) { String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex); this.addSeed(seedUrl); } /*fetch url like "https://blog.github.com/2018-07-13-graphql-for-octokit/" */ this.addRegex("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/"); /*do not fetch jpg|png|gif*/ //this.addRegex("-.*\\.(jpg|png|gif).*"); /*do not fetch url contains #*/ //this.addRegex("-.*#.*"); setThreads(50); getConf().setTopN(100); //enable resumable mode //setResumable(true); } @Override public void visit(Page page, CrawlDatums next) { String url = page.url(); /*if page is news page*/ if (page.matchUrl("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}[^/]+/")) { /*extract title and content of news by css selector*/ String title = page.select("h1[class=lh-condensed]").first().text(); String content = page.selectText("div.content.markdown-body"); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); /*If you want to add urls to crawl,add them to nextLink*/ /*WebCollector automatically filters links that have been fetched before*/ /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/ //next.add("http://xxxxxx.com"); } } public static void main(String[] args) throws Exception { AutoNewsCrawler crawler = new AutoNewsCrawler("crawl", true); /*start crawl with depth of 4*/ crawler.start(4); } }

手动探测URL

ManualNewsCrawler.java:


import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler; /** * Crawling news from github news * * @author hu */ public class ManualNewsCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public ManualNewsCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); // add 5 start pages and set their type to "list" //"list" is not a reserved word, you can use other string instead this.addSeedAndReturn("https://blog.github.com/").type("list"); for(int pageIndex = 2; pageIndex <= 5; pageIndex++) { String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex); this.addSeed(seedUrl, "list"); } setThreads(50); getConf().setTopN(100); //enable resumable mode //setResumable(true); } @Override public void visit(Page page, CrawlDatums next) { String url = page.url(); if (page.matchType("list")) { /*if type is "list"*/ /*detect content page by css selector and mark their types as "content"*/ next.add(page.links("h1.lh-condensed>a")).type("content"); }else if(page.matchType("content")) { /*if type is "content"*/ /*extract title and content of news by css selector*/ String title = page.select("h1[class=lh-condensed]").first().text(); String content = page.selectText("div.content.markdown-body"); //read title_prefix and content_length_limit from configuration title = getConf().getString("title_prefix") + title; content = content.substring(0, getConf().getInteger("content_length_limit")); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); } } public static void main(String[] args) throws Exception { ManualNewsCrawler crawler = new ManualNewsCrawler("crawl", false); crawler.getConf().setExecuteInterval(5000); crawler.getConf().set("title_prefix","PREFIX_"); crawler.getConf().set("content_length_limit", 20); /*start crawl with depth of 4*/ crawler.start(4); } }

You may also like...

发表评论

邮箱地址不会被公开。 必填项已用*标注

推荐文章
加QQ群下载完整Eclipse项目

WebCollector网络爬虫:250108697(已满)

WebCollector网络爬虫2群:345054141

热门