Python爬虫框架：scrapy的简单使用教程

前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

PS：如有需要Python学习资料的小伙伴可以加点击下方链接自行获取
python免费学习资料以及群交流解答点击即可加入

Scrapy框架

win安装

　　- twisted  依赖 ，安装不了，找whl文件
　　- pip3 install wheel
 　 - pip3 install ****.whl
　　- pip3 install pywin32

Django:
        django-admin startproject mysite
        cd mysite
        python manage.py startapp app01
        
    
Scrapy

    # 创建项目
    scrapy startproject sp1
    
    sp1
        - sp1
            - spiders目录
            - middlewares.py    中间件
            - items.py           设置数据存储模板，用于数据格式化　　如django的model
            - pipelines.py        持久化
            - settings.py        配置文件
        - scrapy.cfg             项目的主配置信息，
    
    # 创建爬虫
    cd sp1
    scrapy genspider example example.com

    # 执行爬虫，进入project
    scrapy crawl baidu
    scrapy crawl baidu --nolog　　　　# 无日志

　　 # 展示所有爬虫列表
　　 scrapy list

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下
在这里插入图片描述

# 小试牛刀
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


class DigSpider(scrapy.Spider):
　　 def __init__():
　　　　声明浏览器对象
　　 def closed(self,spider):
　　　　关闭浏览器对象
　　　　# 在中间件process_response方法中获取浏览器对象（spider.bro)
　　　　# 在中间件process_response方法中，获取页面源码数据bro.page_source
　　　　# 将源码数据赋值给HtmlResponse(url,body,request,encoding)的body参数
    

    name = "dig"  # 通过此名称启动爬虫命令
    allowed_domains = ["chouti.com"]
    start_urls = ['http://dig.chouti.com/', ]
    has_request_set = {}

    def parse(self, response):
        hxs = HtmlXPathSelector(response)  # HtmlXpathSelector用于结构化HTML代码并提供选择器功能
        page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
        for page in page_list:
            page_url = 'http://dig.chouti.com%s' % page
            key = self.md5(page_url)
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = page_url
                obj = Request(url=page_url, method='GET', callback=self.parse)
                yield obj

    @staticmethod
    def md5(val):
        import hashlib
        ha = hashlib.md5()
        ha.update(bytes(val, encoding='utf-8'))
        key = ha.hexdigest()
        return key

HtmlXPathSelector提供了类似beautifulsoup解析html的功能
具体使用方法：

from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="link.html">first item</a></li>
            <li class="item-1"><a href="link2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="link1.html">second item</a></div> 
    </body>
</html>
"""
# //子子孙孙   /当前孩子
# 先对字符串进行封装，使其成为一个response对象
response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
# Selector对象
# hxs = HtmlXPathSelector(response)  已被弃用
# hxs = Selector(response).xpath("//a")  # 找到所有的a标签
# hxs = Selector(response).xpath("//a[@id]")  # 找到所有有id属性的a标签
# hxs = Selector(response).xpath("//a[@id='i1']")  # 找到所有有id属性且是i1的a标签
# hxs = Selector(response).xpath('//a[@href="link.html"][@id="i1"]')  # 多个属性并列
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')  # href属性包含link
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')  # href属性以link开头
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')  # 正则匹配 以re:test开始
# 使用extract()变成标签
# hxs = Selector(response).xpath("//a/text()")[0].extract()  # 取所有a标签的text
# hxs = Selector(response).xpath('//a/@href')[0].extract_first()  # 取第一个a标签的href
# print(hxs)

item_list = response.xpath("//div[@id='content-list']/div[@class='item']")
for item in item_list:
    text = item.xpath('.//a/text()').extract_first()
    href = item.xpath('.//a/@href').extract_first()
    print(href, text.strip())

Biegral Blog

Python爬虫框架：scrapy的简单使用教程

Scrapy框架

阅读排行

分类

归档