Java分布式爬虫Nutch参数配置——http.content.limit

by briefcopy · Published 2016年5月27日 · Updated 2016年12月11日

对于大部分使用Nutch的用户来说，项目配置文件conf/nutch-default.xml中的http.content.limit是必须修改的参数，配置文件中对http.content.limit的默认配置如下：

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

http.content.limit参数的单个页面的大小上限，单位是字节，默认上限是65536字节，也就是约66K，如果页面大小超过了66K，Nutch只下载页面前66K的内容。对于面向搜索引擎业务的爬虫，如果不控制单个网页（资源）的上限，有可能造成很多不必要的浪费。例如Nutch在爬取某网站时检测到了一个新的URL，而该URL指向某个10G的视频资源，而当前业务并不需要视频数据，如果将http.content.limit设置为65536，Nutch只会下载视频前65536字节的数据，如果没有这个上限，爬虫会白白浪费10G的流量和下载用的时间。

国内很多Nutch用户将Nutch应用于精数据采集，只有获取完整的页面，基于正则、DOM树的抽取规则才可以正确地从网页中获取相关的精数据，因此该类用户需要通过配置让Nutch无视该上限，只要将http.content.limit设置为一个负数即可（一定要是负数，不能是0）。例如我们可以将其修改为下面的代码：

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

这里要注意一点，该参数控制的是通过http协议的下载，不仅包括网页，还包括使用http协议下载的flash、视频等文件。

Tags: JAVA开源爬虫 Nutch 分布式爬虫

You may also like...

发表回复取消回复