Scrapy框架
　　Scrapy是python下实现爬虫功能的框架，能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。
Scrapy安装
安装依赖包

yum install gcc libffi-devel python-devel openssl-devel -y
yum install libxslt-devel -y

安装scrapy

pip install scrapy<br>pip install twisted==13.1.0

注意事项：scrapy和twisted存在兼容性问题，如果安装twisted版本过高，运行scrapy startproject project_name的时候会提示报错，安装twisted==13.1.0即可。

3. 基于Scrapy爬取数据并存入到CSV

3.1. 爬虫目标，获取简书中热门专题的数据信息，站点为https://www.jianshu.com/recommendations/collections，点击"热门"是我们需要爬取的站点，该站点使用了AJAX异步加载技术，通过F12键——Network——XHR，并翻页获取到页面URL地址为https://www.jianshu.com/recommendations/collections?page=2&order_by=hot，通过修改page=后面的数值即可访问多页的数据，如下图：

3.2. 爬取内容

需要爬取专题的内容包括：专题内容、专题描述、收录文章数、关注人数，Scrapy使用xpath来清洗所需的数据，编写爬虫过程中可以手动通过lxml中的xpath获取数据，确认无误后再将其写入到scrapy代码中，区别点在于，scrapy需要使用extract()函数才能将数据提取出来。

3.3 创建爬虫项目

##代码内容

import scrapy
from scrapy import Item
from scrapy import Field
'''
遇到不懂的问题？Python学习交流群：1136201545满足你的需求，资料都已经上传群文件，可以自行下载！
''' 
class JianshuHotTopicItem(scrapy.Item):
    '''
    @scrapy.item，继承父类scrapy.Item的属性和方法，该类用于定义需要爬取数据的子段
    '''
    collection_name = Field()
    collection_description = Field()
    collection_article_count = Field()
    collection_attention_count = Field()

piders/jianshu_hot_topic_spider.py代码内容，实现数据获取的代码逻辑，通过xpath实现

#_*_ coding:utf8 _*_
 
import random
from time import sleep
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from jianshu_hot_topic.items import JianshuHotTopicItem
 
class jianshu_hot_topic(CrawlSpider):
    '''
    简书专题数据爬取，获取url地址中特定的子段信息
    '''
    name = "jianshu_hot_topic"
    start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]
 
    def parse(self,response):
        '''
        @params:response,提取response中特定字段信息
        '''
        item = JianshuHotTopicItem()
        selector = Selector(response)
        collections = selector.xpath('//div[@class="col-xs-8"]')   
        for collection in collections:
            collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()
                    collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()
                    collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')
                    collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')
            item['collection_name'] = collection_name
            item['collection_description'] = collection_description
            item['collection_article_count'] = collection_article_count
            item['collection_attention_count'] = collection_attention_count
 
            yield item
         
         
        urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)]
        for url in urls:
            sleep(random.randint(2,7))
            yield Request(url,callback=self.parse)

pipelines文件内容，定义数据存储的方式，此处定义数据存储的逻辑，可以将数据存储载MySQL数据库，MongoDB数据库，文件，CSV，Excel等存储介质中，如下以存储载CSV为例：

# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
 
import csv
 
class JianshuHotTopicPipeline(object):
    def process_item(self, item, spider):
        f = file('/root/zhuanti.csv','a+')
    writer = csv.writer(f)
    writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))
        return item

修改settings文件

ITEM_PIPELINES = {
    'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300,
}

Biegral Blog

Python使用Scrapy框架爬取数据存入CSV文件

3. 基于Scrapy爬取数据并存入到CSV

3.2. 爬取内容

3.3 创建爬虫项目

阅读排行

分类

归档