暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

分布式爬虫scrapy-redis实例

diff 2025-07-15
109

1.配置普通爬虫
(1).创建项目
scrapy startproject tdxweb

(2).打开项目目录
cd tdxweb

(3).创建爬虫脚本
scrapy genspider crawl "https://www.tdx.com.cn/article/category/notice.html"

(4).修改脚本crwal.py
import scrapy

class CrawlSpider(scrapy.Spider):
name = "crawl"
allowed_domains = ["www.tdx.com.cn"]
start_urls = ["https://www.tdx.com.cn/article/category/notice.html"]

def parse(self, response):
product_list = response.css('ul.article-right-group')
for article in product_list.css('li'):
item_name = article.css('a::text').get()
item_date = article.css('span::text').get()
if item_name is not None and item_date is not None:
print(item_name+'|'+item_date)

(5).启动爬虫
scrapy crawl crawl


2.修改为scrapy-redis
(1).修改 items.py文件
import scrapy
class MyspiderItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
date = scrapy.Field()

(2).修改settings.py文件
BOT_NAME = "tdxweb"
SPIDER_MODULES = ["tdxweb.spiders"]
NEWSPIDER_MODULE = "tdxweb.spiders"

# 分布式核心配置
# 使用 Scrapy-Redis 的调度器(替代默认调度器)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 使用 Scrapy-Redis 的去重过滤器(替代默认去重)
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Redis 连接配置(根据实际情况修改)
REDIS_HOST = '192.168.4.171' # Redis 服务器 IP
REDIS_PORT = 6379 # Redis 端口
REDIS_DB = 0 # Redis 数据库编号
# REDIS_PARAMS = {'password': 'your_redis_password'} # 若有密码需添加

# 调度器持久化(爬虫结束后不清除 Redis 中的请求队列)
SCHEDULER_PERSIST = True

# 并发与延迟设置(反爬)
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 2 # 秒

# 管道配置(可选,用于存储数据)
ITEM_PIPELINES = {
# 存储到 Redis(可选)
'scrapy_redis.pipelines.RedisPipeline': 300,
# 自定义管道(如存储到 MySQL/MongoDB)
# 'douban_spider.pipelines.DoubanPipeline': 400,
}

# 日志配置
LOG_LEVEL = 'INFO'

(3).修改 crwal.py
import scrapy
from scrapy_redis.spiders import RedisSpider
from ..items import TdxwebItem

class CrawlSpider(RedisSpider):
name = "crawl"
#allowed_domains = ["www.tdx.com.cn"]
#start_urls = ["https://www.tdx.com.cn/article/category/notice.html"]
redis_key = 'tdx:start_urls' # 回去redis(公共调度器)里面获取key为taoche的数据 taoche:[]


def parse(self, response):
#print(response.css('#post_list').getall())
product_list = response.css('ul.article-right-group')
for article in product_list.css('li'):
item_title = article.css('a::text').get()
item_date = article.css('span::text').get()
if item_title is not None and item_date is not None:
item = TdxwebItem()
item['title'] = item_title
item['date'] = item_date
yield item

(4).启动爬虫
scrapy crawl crawl

















「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论