【工具】Web Scraper 网页爬取全国4000家种子企业信息

1. 简介

Web Scraper是Chrome/Firefox浏览器插件，跨平台使用。

优点：使用简单，无需编程，鼠标点点就可；轻量快速爬取。

缺点：小数据量；不能爬图片；不能中止；整体较慢（网速影响可能不稳定）；爬取结果乱序。

2. 基础

chrome应用商店安装插件需要科学上网，但凡用过chrome插件的都知道安装，这里不讲。

官方教程：https://www.webscraper.io/documentation

中文教程基础用法可看下面这个：

Web Scraper 使用教程（一）- 安装

Web Scraper 使用教程（二）- 基本用法之安装、配置、运行

Web Scraper 使用教程（三）- 基本用法（常用选择器类型）

Web Scraper 使用教程（四）- 进阶用法（同一个页面爬取多个类型内容）

Web Scraper 使用教程（五）- 进阶用法（爬取向下滚动加载页面）

Web Scraper 使用教程（六）- 进阶用法（网址有规律变化进行翻页）

Web Scraper 使用教程（七）- 进阶用法（点击「翻页器」进行翻页）

Web Scraper 使用教程（八）- 进阶用法（点击「更多」进行翻页）

Web Scraper 使用教程（九）- 进阶用法（动态加载进行翻页）

Web Scraper 使用教程（十）- 爬取二级页面的内容

3. 实战

需求

例如，我想要爬取第一种业网中所有种子企业信息(http://www.a-seed.cn/index.php?m=content&c=search&catid=32&dosubmit=1)。

首页内容：

其中一个种子企业的信息：

现在要把每个种子企业的全部信息（二级页面）爬取下来。

操作

查看翻页有连续规律。
ctrl+shift+i或者F12进入开发者模式。
对一级页面进行选择

在二级页面下分别设置选择器

爬取结果（4000家种企信息，按2s一家，也要将近2个半小时才能爬取完，所以不适合大数据）

最后分享下此次爬取设置的sitemap，你可直接导入试试。

{"_id":"company_allinformation","startUrl":["http://www.a-seed.cn/index.php?info%5Btitle%5D=&info%5Bscjypz%5D=&info%5Bdiyu%5D=&info%5Bcatid%5D=32&zhucezb_start=&zhucezb_end=&m=content&c=search&a=init&catid=32&dosubmit=1&page=[1-200]"],"selectors":[{"id":"links","parentSelectors":["_root"],"type":"SelectorLink","selector":"tr:nth-of-type(n+2) a","multiple":true,"delay":0},{"id":"name","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('企业名称') td","multiple":false,"delay":0,"regex":""},{"id":"province","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('地域') td","multiple":false,"delay":0,"regex":""},{"id":"money","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('注册资本') td","multiple":false,"delay":0,"regex":""},{"id":"Main_business","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('主营业务') td","multiple":false,"delay":0,"regex":""},{"id":"variety","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('生产经营品种') td","multiple":false,"delay":0,"regex":""},{"id":"company_type","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('公司类型') td","multiple":false,"delay":0,"regex":""},{"id":"date","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('营业期限') td","multiple":false,"delay":0,"regex":""},{"id":"RegistrationNumber","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('注册号') td","multiple":false,"delay":0,"regex":""},{"id":"registration_authority","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('登记机关') td","multiple":false,"delay":0,"regex":""},{"id":"addr","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('公司地址') td","multiple":false,"delay":0,"regex":""},{"id":"website","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('公司网址') td","multiple":false,"delay":0,"regex":""},{"id":"email","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('E-MAIL') td","multiple":false,"delay":0,"regex":""},{"id":"code","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('邮编') td","multiple":false,"delay":0,"regex":""},{"id":"tel","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('电话') td","multiple":false,"delay":0,"regex":""},{"id":"name_people","parentSelectors":["links"],"type":"SelectorText","selector":"tr:contains('法定代表人') td","multiple":false,"delay":0,"regex":""}]}