首发于 Antenna的python学习笔记

《Python网络爬虫与信息提取》笔记（12）

li li

搞IT的厨子，是个好工程师，混迹IT行业20年，也许有用。

实例4：股票数据Scrapy爬虫

4.小结

5.实践

1.建立工程和Spider模板

(py3MachineLearning) D:\Python网络爬虫与信息提取>scrapy startproject teststock

2.编写Spider

(py3MachineLearning) D:\Python网络爬虫与信息提取>cd teststock
(py3MachineLearning) D:\Python网络爬虫与信息提取\teststock>scrapy genspider stocks baidu.com

代码：

spiders/stocks.py 课程中的网站发生了变化，我更换了获取股票列表的网站

# -*- coding: utf-8 -*-
import scrapy
import re


class StocksSpider(scrapy.Spider):
    name = 'stocks'
    # allowed_domains = ['gupiaozhidao.com']
    start_urls = ['http://www.gupiaozhidao.com/quote/allstock.shtml']

    def parse(self, response):
        for stock_code in response.css('a::text'):
            try:
                stock = re.findall(r"\((.*?)\)", str(stock_code))[1]
                if stock:
                    url = 'https://gupiao.baidu.com/stock/sh' + stock + '.html'
                    yield scrapy.Request(url, callback=self.parse_stock)
            except Exception as e:
                print("Error:{}".format(e))
    
    def parse_stock(self, response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val

        infoDict.update(
            {'股票名称': re.findall('\s.*\(',name)[0].split()[0] + \
             re.findall('\>.*\<', name)[0][1:-1]})
        yield infoDict

3.编写pipelines

配置pipelines.py文件，定义对爬取项（Scraped Item）的处理类

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TeststockPipeline(object):
    def process_item(self, item, spider):
        return item

class TeststockInfoPipeline(object):
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

在settings.py中配置ITEM_PIPELINES选项

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'teststock.pipelines.TeststockInfoPipeline': 300,
}

4.执行

(py3MachineLearning) D:\Python网络爬虫与信息提取>scrapy crawl stocks

5.可选优化选项

（1）配置headers，防止网站反爬

在settings.py中配置：

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # 'teststock.middlewares.TeststockDownloaderMiddleware': 543,
    'teststock.middlewares.RandomUserAgentMiddleware': 400,
}

USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
# ---------------------
# 作者：后青春诗ing
# 来源：CSDN
# 原文：https://blog.csdn.net/weixin_42812527/article/details/81366397
# 版权声明：本文为博主原创文章，转载请附上博文链接！

在Middlewares.py中配置

from teststock.settings import USER_AGENT_LIST
import random


class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        # 下面是随机更换user-agent的代码
        # rand_use  = random.choice(USER_AGENT_LIST)
        # if rand_use:
        #     request.headers.setdefault('User-Agent', rand_use)
        
        # 观察发现，频繁更换user-agent反而暴露了自己是爬虫
        request.headers.setdefault('User-Agent', "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24")

（2）控制爬取速度，防止网站反爬

在settings.py中配置，设置自动控制爬取速度：

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
# 参考：scrapy的cookie禁用以及自动限速 https://blog.csdn.net/weixin_42260204/article/details/81096459
AUTOTHROTTLE_ENABLED = True

也可以手动设置爬取速度：

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 1

观察发现，自动控制的就挺好。

（3）后续优化

后续可以继续研究怎样设置代理、爬取动态网站等，参考资料：

6.执行效果：