Scrapy之动态网页爬取
动态网页通过JavaScript与html交互
可以通过Splash渲染引擎爬取
1.下载Splash
使用Docker安装
注意,如果你是windows家庭版则安装方式不同
可以参考Scrapy-Splash的安装(windows篇)
由于我是教育版,直接安装docker for desktop 就行
Docker Desktop for Mac and Windows | Docker

网上教程有点乱,我看这个还挺好的
下载Desktop重启后可能会提示WSL2更新,直接在相关网站更新就行了
在 Windows 10 上安装 WSL | Microsoft Docs
安装好后可以在power shell里运行1
docker run -d -p 8050:8050 scrapinghub/splash
应该会跳出一个页面(可能时间比较长)
Splash 简介与安装 - 孔雀东南飞 - 博客园 (cnblogs.com)
可以看看这篇博客
然后 下载scrapy-splash 一个python库操作splash的1
pip install scrapy-splash
Splash - A javascript rendering service — Splash 3.5 documentation
splash的文档
服务端点
render.html 提供JavaScript页面渲染服务
url参数为要渲染的网址
execute 执行用户自定义的渲染脚本(lua),在页面中执行JavaScript代码
请求方式post lua_source 用户自定义的脚本
实战
下载了scrapy-splash后配置settings.py文件1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import scrapy.downloadermiddlewares.httpcompression
import scrapy_splash
BOT_NAME = 'splash_example'
SPIDER_MODULES = ['splash_example.spiders']
NEWSPIDER_MODULE = 'splash_example.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
SPIDER_MIDDLEWARES = {
#    'splash_example.middlewares.SplashExampleSpiderMiddleware': 543,
   'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}
DOWNLOADER_MIDDLEWARES = {
#    'splash_example.middlewares.SplashExampleDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware':732,
    'scrapy_splash.SplashMiddleware':725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810
    }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = "http://localhost:8050"
使用splashRequest()请求即可
url,args,cache_args1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import scrapy
import scrapy_splash
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/js/']
    def start_requests(self):
        args = {}
        for url in self.start_urls:
            yield scrapy_splash.SplashRequest(url)
    def parse(self, response):
        for sel in response.xpath("//div[@class='quote']"):
            quote = sel.xpath("./span[1]/text()").extract_first()
            author = sel.xpath("string(./span[2])").extract_first()
            yield {
                'quote':quote,
                'author':author
            }
        href = response.xpath("//li[@class='next']/a/@href").extract_first()
        if href:
            url = response.urljoin(href)
            yield scrapy_splash.SplashRequest(url)
当然要打开docker的splash
