知乎用户分布研究

- 前言
- 框架搭建
- 模块化
  - 爬虫
  - 数据库
  - 调度器
  - web服务
- 总结

前言

虽然知乎早已不是最开始的样子了，但是其用户还是很广泛的。我原本打算做的写个爬虫，把用户的居住地，学历，专业等信息爬下来。然后持久化到数据库中，最后写个web服务，用图标的形式展示出来。

但是echarts地图这块，还需努力。尽管做了调试，效果还是不甚理想。汗颜(⊙﹏⊙)b

框架搭建

正如前言部分所述，这里用到的技术还是挺多的。
简要的来展示一下项目目录吧。

C:\Users\biao\Desktop\network\code\zhihu-range>tree . /f
文件夹 PATH 列表
卷序列号为 E0C6-0F15
C:\USERS\BIAO\DESKTOP\NETWORK\CODE\ZHIHU-RANGE
│  dbhelper.py
│  scheduler.py
│  spider.py
│  zhihu.db
│  __init__.py
│
├─web
│  │  service.py
│  │  __init__.py
│  │
│  ├─static
│  │      china.js
│  │      echarts.js
│  │      echarts.min.js
│  │      jquery-2.2.4.min.js
│  │
│  └─templates
│          index.html
│
└─__pycache__
        dbhelper.cpython-36.pyc
        spider.cpython-36.pyc

模块化

接下来就一点点的对每一个小模块进行实现吧。

爬虫

爬虫部分需要注意的有这么几点。

请求头上的authorization
然后是请求频率的控制，通过添加随机时延可以明显的改善防爬虫限制
获取关注我的人的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20

获取我关注的人的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20

获取我的信息：

https://www.zhihu.com/api/v4/members/zhi-ai-89-18?include=locations%2CemploymentsXXXXXXXXXXXX

明确了这点，基本上对于爬虫就没有什么问题了。详见代码部分。

# coding: utf8

# @Author: 郭 璞
# @File: spider.py                                                                 
# @Time: 2017/5/22                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 爬虫，爬取地域数据

import requests
import json
import re
import math

class Spider(object):

    def __init__(self):
        """
        初始化请求头，必备一个authorization，否则无法获取到数据。
        """
        self.headers = {
            'authorization': 'Bearer Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
            'Host': 'www.zhihu.com',
            'x-udid': 'ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=',
        }
        self.cookie = {
            'Cookie': 'q_c1=cbf69b836d4645b29f057b71be86c00e|1493896915000|1493896915000; r_cap_id="NWY3YjIzYzlmOTg0NDVhM2FmMzdjNzA1YzY5NTBlYmU=|1494146108|664527b0598db30d7734ff56ea5ac12b17cbe2d8"; cap_id="MWRhOTIzNGYzZDdjNDA3MjhiNTg1MGQ3ZDJlMjQ5NWE=|1494146108|94fc913a73ce89aeb3b60439fdcc69687baf438d"; d_c0="ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=|1494146110"; _zap=c27db1fb-911e-48bd-babe-3b6e66c3e558; _xsrf=55d8c6a475335b06ee3e848612afdd80; aliyungf_tc=AQAAAJ+R5xghJQIAlnF1b59VTAruEEc9; acw_tc=AQAAAGxlvy3TLgIAlnF1bxgpA2LSD8+W; s-q=%E6%A2%81%E5%8B%87; s-i=1; sid=p74htbkp; z_c0=Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a; __utma=155987696.1489589582.1495414813.1495414813.1495414813.1; __utmb=155987696.0.10.1495414813; __utmc=155987696; __utmz=155987696.1495414813.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
        }


    def parse_homepage(self, username):
        # 方式一
        # homeurl = "https://www.zhihu.com/people/{}".format(username)
        # response = requests.get(url=homeurl, headers=self.headers)
        # if response.status_code == 200:
        #     followees_number = int(re.findall(re.compile('followingCount&quot;:(\d+),'), response.text)[0])
        #     followers_number = int(re.findall(re.compile('se,&quot;followerCount&quot;:(\d+),'), response.text)[0])
        #     print("关注了", followees_number)
        #     print("被关注", followers_number)
        #     return (followees_number, followers_number)
        # else:
        #     print(response.status_code)

        ###-------------------------------------------------
        """
            返回`username`对应的居住地， 学校名称，专业名称
            :param username:
            :return:
            """
        # 方式二
        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(
            username)
        response = requests.get(url=tempurl, headers=self.headers)
        if response.status_code == 200:
            data = json.loads(response.text)
            return (data['following_count'], data['follower_count'])
        else:
            print(response.status_code)

    def get_location_edu(self, username):
        """
        返回`username`对应的居住地， 学校名称，专业名称
        :param username:
        :return:
        """
        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(username)
        response = requests.get(url=tempurl, headers=self.headers)
        if response.status_code == 200:
            data = json.loads(response.text)
            try:
                location = data['locations'][0]['name']
            except:
                location = "未填写"

            # 处理学校
            try:
                school = data['educations'][0]['school']['name']
                major = data['educations'][0]['major']['name']
            except:
                school = "未填写"
                major = "未填写"

            return (username, location, school, major)


        else:
            print(response.status_code)

    def get_followees(self, username):
        """
        获取 :username 所关注的人的列表
        :param username:
        :return:
        """
        # 先获取用户关注的人的总数，来确定分页的范围
        homeparsed = self.parse_homepage(username=username)
        print(homeparsed)
        followees_number = homeparsed[0]
        pages = math.ceil(followees_number/20)

        # 设置一个集合，去除重复元素
        followee_result = []

        counter = 1
        for offset in range(pages):
            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followees?offset={offset}&limit=20'.format(username=username, offset=offset*20)
            response = requests.get(url=tempurl, headers=self.headers)
            if response.status_code == 200:
                data = json.loads(response.text)
                followees = data['data']
                for followee in followees:
                    # print(counter, ":  ", followee['url_token'])
                    followee_result.append(followee['url_token'])
                    counter += 1
            else:
                print(response.status_code)

        # 返回无重复的username所关注的人列表
        return list(set(followee_result))

    def get_followers(self, username):
        """
            获取关注了 :username 的人的列表
            :param username:
            :return:
            """
        # 先获取关注username的人的总数，来确定分页的范围
        homeparsed = self.parse_homepage(username=username)
        print(homeparsed)
        followers_number = homeparsed[1]
        pages = math.ceil(followers_number / 20)

        # 设置一个集合，去除重复元素
        follower_result = []

        counter = 1
        for offset in range(pages):
            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followers?offset={offset}&limit=20'.format(
                username=username, offset=offset * 20)
            response = requests.get(url=tempurl, headers=self.headers)
            if response.status_code == 200:
                data = json.loads(response.text)
                followees = data['data']
                for followee in followees:
                    # print(counter, ":  ", followee['url_token'])
                    follower_result.append(followee['url_token'])
                    counter += 1
            else:
                print(response.status_code)

        # 返回无重复的username所关注的人列表
        return list(set(follower_result))


if __name__ == '__main__':
    spider = Spider()
    # spider.get_followees(username='tianshansoft')
    # spider.parse_homepage(username='zhi-ai-89-18')
    # location = spider.get_location_edu(username='zhi-ai-89-18')
    # print(location)
    # print(spider.parse_homepage(username='tianshansoft'))
    # followee_result = spider.get_followees(username='tianshansoft')
    # print(followee_result)
    # print(len(followee_result))

    followers_result = spider.get_followers(username='tianshansoft')
    print(len(followers_result))
    print(followers_result[:100])

数据库

数据库为了更加简单，方便。这里就采用sqlite3好了。因为本次的需求很简单，所以只需要一张表就可以了。

create table user(
    id INTEGER not null  primary key autoincrement,
    username varchar(36) not null,
    location varchar(255),
    school varchar(255),
    major varchar(255)
);

然后还需要一个数据库工具类，要不然每次都写那么多重复的代码，也没什么意义。

# coding: utf8

# @Author: 郭 璞
# @File: dbhelper.py                                                                 
# @Time: 2017/5/22                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 数据库相关操作工具类

import sqlite3

class DbConfig(object):

    DATABASE_FILE_PATH = 'zhihu.db'


class DbHelper(object):

    def __init__(self):
        self.conn = sqlite3.connect(DbConfig.DATABASE_FILE_PATH)

    def create_table(self):
        # 自增字段关键字AUTOINCREMENT.
        sql = """
        create table user(
        id INTEGER not null  primary key autoincrement,
        username varchar(36) not null,
        location varchar(255),
        school varchar(255),
        major varchar(255)
        );
        """
        cursor = self.conn.cursor()
        cursor.execute(sql)
        cursor.close()

    def add(self, data=()):
        cursor = self.conn.cursor()
        sql = "insert into user(name, location, school, major) values('{}', '{}', '{}', '{}');".format(data[0], data[1], data[2], data[3])
        cursor.executescript(sql)
        # cursor.execute(sql)
        cursor.close()

    def get_data(self):
        cursor = self.conn.cursor()
        sql = "select location, count(location) as numbers from user group by location"
        cursor.execute(sql)
        resultset = cursor.fetchall()
        print(resultset)



if __name__ == '__main__':
    dbhelper = DbHelper()
    # dbhelper.create_table()
    # data = {
    #     'username': 'zhi-ai-89-18',
    #     'location': '大连',
    #     'school': '大连理工大学',
    #     'major':'软件',
    # }
    # data = ('tianshansoft', '上海', 'weizhi', 'software')
    # dbhelper.add(data=data)
    dbhelper.get_data()

这里简单的以需求驱动开发，我需要的功能也就存储数据，查询数据，所以这个工具类写的很简单。但是从功能上来说，却是足够了。

最后来看看，之乎用户地区的人数分布情况。用到的SQL语句如下：

select location, count(location) as numbers from user   group by location ORDER BY numbers DESC

结果如下：

知乎用户地理分布

调度器

调度器是一个概念化的名词。作用就是粘合爬虫和数据持久层。根据六度空间理论，社交网是一个超大的互联。所以基本上来说爬虫是爬不干净所有用户的，于是只能退而求其次，爬取一部分吧。虽然是一部分，但是这还是相当于随机抽样，部分与整体的差别不会很大。

下面简要的来做下调度（说是简要，是因为没有做去重操作）

# coding: utf8

# @Author: 郭 璞
# @File: scheduler.py                                                                 
# @Time: 2017/5/22                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 程序调度器，用于粘合各个模块，实现配合工作。

import spider
import dbhelper
import time, random

sp = spider.Spider()


entrance = 'ghostcomputing'
queue = [entrance]
container = []
LEVEL = 3

counter = 0
dbhelper = dbhelper.DbHelper()


while queue:
    if counter>=10000:
        break
    else:
        temp = queue.pop(0)
        followees = sp.get_followees(username=temp)
        queue.extend(followees)
        counter += (len(followees)-1)
        # 随即休眠
        timeseed = random.randint(1, 5)
        print('随即休眠{}秒！'.format(timeseed))
        time.sleep(timeseed)

        # 获取关注username的人的详细信息
        for index, followee in enumerate(followees):
            # container.append(sp.get_location_edu(username=followee))
            data = sp.get_location_edu(username=followee)
            dbhelper.add(data=data)
            print('{} 信息获取完成'.format(followee))

            # 随即休眠
            if index%28==0:
                timeseed = random.randint(1, 3)
                print('随即休眠{}秒！'.format(timeseed))
                time.sleep(timeseed)




print(container)

web服务

echarts最好的使用就是前后端分离，所以使用接口技术来为前端的图标提供数据是一个不错的选择。之前写过一个用PHP做后台提供数据的，这里同样可以。使用JQuery也很方便。

不过，这里我打算试用一下Flask，更加的轻量。但是使用之前需要注意一个问题，那就是对于模板引擎来说，HTML代码已经不能算是原来的HTML代码了，其中对于JavaScript， CSS这些文件的路径要手动处理一下，否则他们无法被正确的找到。

函数 ： url_for（"static的path，一般为static", filename="想要在src上显示的值，通常是改文件在static中的路径"）

比如：我想要一个<script src="echarts.js">
那么:

模板中要这么写： <script src="{{ echarts_path}}">

在后台就可以这么写:
echarts_path = url_for('static', filename='echarts.js')
return render_template('index.html', echarts_path=echarts_path)

明白了这一点，就可以把脚本和样式应用到我们自己的模板上了。

http://echarts.baidu.com/echarts2/doc/example/map15.html

而我只画出了一个中国地图。。。。。。
实现效果图

待做… …

总结

回顾就是，爬虫那块对接口的数据获取，操作sqlite3,以及web服务中静态资源的显示。其他图形化展示继续加油。

Biegral Blog