Scapy_动态网页爬取

Scrapy之动态网页爬取

动态网页通过JavaScript与html交互

可以通过Splash渲染引擎爬取

1.下载Splash

使用Docker安装

注意,如果你是windows家庭版则安装方式不同

可以参考Scrapy-Splash的安装(windows篇)

由于我是教育版,直接安装docker for desktop 就行

Docker Desktop for Mac and Windows | Docker

image-20210721143756897

Docker 教程 | 菜鸟教程 (runoob.com)

网上教程有点乱,我看这个还挺好的

下载Desktop重启后可能会提示WSL2更新,直接在相关网站更新就行了

在 Windows 10 上安装 WSL | Microsoft Docs

安装好后可以在power shell里运行

1
docker run -d -p 8050:8050 scrapinghub/splash

应该会跳出一个页面(可能时间比较长)

Splash 简介与安装 - 孔雀东南飞 - 博客园 (cnblogs.com)

可以看看这篇博客

然后 下载scrapy-splash 一个python库操作splash的

1
pip install scrapy-splash

Splash - A javascript rendering service — Splash 3.5 documentation

splash的文档

服务端点

render.html 提供JavaScript页面渲染服务

url参数为要渲染的网址

execute 执行用户自定义的渲染脚本(lua),在页面中执行JavaScript代码

请求方式post lua_source 用户自定义的脚本

实战

下载了scrapy-splash后配置settings.py文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import scrapy.downloadermiddlewares.httpcompression
import scrapy_splash
BOT_NAME = 'splash_example'

SPIDER_MODULES = ['splash_example.spiders']
NEWSPIDER_MODULE = 'splash_example.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
SPIDER_MIDDLEWARES = {
# 'splash_example.middlewares.SplashExampleSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}

DOWNLOADER_MIDDLEWARES = {
# 'splash_example.middlewares.SplashExampleDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware':732,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

SPLASH_URL = "http://localhost:8050"

使用splashRequest()请求即可

url,args,cache_args

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import scrapy
import scrapy_splash


class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

def start_requests(self):
args = {}
for url in self.start_urls:
yield scrapy_splash.SplashRequest(url)

def parse(self, response):
for sel in response.xpath("//div[@class='quote']"):
quote = sel.xpath("./span[1]/text()").extract_first()
author = sel.xpath("string(./span[2])").extract_first()
yield {
'quote':quote,
'author':author
}
href = response.xpath("//li[@class='next']/a/@href").extract_first()
if href:
url = response.urljoin(href)
yield scrapy_splash.SplashRequest(url)

当然要打开docker的splash

-------------本文结束感谢您的阅读-------------
感谢阅读.

欢迎关注我的其它发布渠道