前言
虽然知乎早已不是最开始的样子了,但是其用户还是很广泛的。我原本打算做的写个爬虫,把用户的居住地,学历,专业等信息爬下来。然后持久化到数据库中,最后写个web服务,用图标的形式展示出来。
但是echarts地图这块,还需努力。尽管做了调试,效果还是不甚理想。汗颜(⊙﹏⊙)b
框架搭建
正如前言部分所述,这里用到的技术还是挺多的。
简要的来展示一下项目目录吧。
C:\Users\biao\Desktop\network\code\zhihu-range>tree . /f
文件夹 PATH 列表
卷序列号为 E0C6-0F15
C:\USERS\BIAO\DESKTOP\NETWORK\CODE\ZHIHU-RANGE
│ dbhelper.py
│ scheduler.py
│ spider.py
│ zhihu.db
│ __init__.py
│
├─web
│ │ service.py
│ │ __init__.py
│ │
│ ├─static
│ │ china.js
│ │ echarts.js
│ │ echarts.min.js
│ │ jquery-2.2.4.min.js
│ │
│ └─templates
│ index.html
│
└─__pycache__
dbhelper.cpython-36.pyc
spider.cpython-36.pyc
模块化
接下来就一点点的对每一个小模块进行实现吧。
爬虫
爬虫部分需要注意的有这么几点。
- 请求头上的authorization
然后是请求频率的控制,通过添加随机时延可以明显的改善防爬虫限制
获取关注我的人的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
- 获取我关注的人的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
- 获取我的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18?include=locations%2CemploymentsXXXXXXXXXXXX
明确了这点,基本上对于爬虫就没有什么问题了。详见代码部分。
# coding: utf8
# @Author: 郭 璞
# @File: spider.py
# @Time: 2017/5/22
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 爬虫,爬取地域数据
import requests
import json
import re
import math
class Spider(object):
def __init__(self):
"""
初始化请求头,必备一个authorization,否则无法获取到数据。
"""
self.headers = {
'authorization': 'Bearer Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
'Host': 'www.zhihu.com',
'x-udid': 'ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=',
}
self.cookie = {
'Cookie': 'q_c1=cbf69b836d4645b29f057b71be86c00e|1493896915000|1493896915000; r_cap_id="NWY3YjIzYzlmOTg0NDVhM2FmMzdjNzA1YzY5NTBlYmU=|1494146108|664527b0598db30d7734ff56ea5ac12b17cbe2d8"; cap_id="MWRhOTIzNGYzZDdjNDA3MjhiNTg1MGQ3ZDJlMjQ5NWE=|1494146108|94fc913a73ce89aeb3b60439fdcc69687baf438d"; d_c0="ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=|1494146110"; _zap=c27db1fb-911e-48bd-babe-3b6e66c3e558; _xsrf=55d8c6a475335b06ee3e848612afdd80; aliyungf_tc=AQAAAJ+R5xghJQIAlnF1b59VTAruEEc9; acw_tc=AQAAAGxlvy3TLgIAlnF1bxgpA2LSD8+W; s-q=%E6%A2%81%E5%8B%87; s-i=1; sid=p74htbkp; z_c0=Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a; __utma=155987696.1489589582.1495414813.1495414813.1495414813.1; __utmb=155987696.0.10.1495414813; __utmc=155987696; __utmz=155987696.1495414813.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
}
def parse_homepage(self, username):
# 方式一
# homeurl = "https://www.zhihu.com/people/{}".format(username)
# response = requests.get(url=homeurl, headers=self.headers)
# if response.status_code == 200:
# followees_number = int(re.findall(re.compile('followingCount":(\d+),'), response.text)[0])
# followers_number = int(re.findall(re.compile('se,"followerCount":(\d+),'), response.text)[0])
# print("关注了", followees_number)
# print("被关注", followers_number)
# return (followees_number, followers_number)
# else:
# print(response.status_code)
###-------------------------------------------------
"""
返回`username`对应的居住地, 学校名称,专业名称
:param username:
:return:
"""
# 方式二
tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(
username)
response = requests.get(url=tempurl, headers=self.headers)
if response.status_code == 200:
data = json.loads(response.text)
return (data['following_count'], data['follower_count'])
else:
print(response.status_code)
def get_location_edu(self, username):
"""
返回`username`对应的居住地, 学校名称,专业名称
:param username:
:return:
"""
tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(username)
response = requests.get(url=tempurl, headers=self.headers)
if response.status_code == 200:
data = json.loads(response.text)
try:
location = data['locations'][0]['name']
except:
location = "未填写"
# 处理学校
try:
school = data['educations'][0]['school']['name']
major = data['educations'][0]['major']['name']
except:
school = "未填写"
major = "未填写"
return (username, location, school, major)
else:
print(response.status_code)
def get_followees(self, username):
"""
获取 :username 所关注的人的列表
:param username:
:return:
"""
# 先获取用户关注的人的总数,来确定分页的范围
homeparsed = self.parse_homepage(username=username)
print(homeparsed)
followees_number = homeparsed[0]
pages = math.ceil(followees_number/20)
# 设置一个集合,去除重复元素
followee_result = []
counter = 1
for offset in range(pages):
tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followees?offset={offset}&limit=20'.format(username=username, offset=offset*20)
response = requests.get(url=tempurl, headers=self.headers)
if response.status_code == 200:
data = json.loads(response.text)
followees = data['data']
for followee in followees:
# print(counter, ": ", followee['url_token'])
followee_result.append(followee['url_token'])
counter += 1
else:
print(response.status_code)
# 返回无重复的username所关注的人列表
return list(set(followee_result))
def get_followers(self, username):
"""
获取关注了 :username 的人的列表
:param username:
:return:
"""
# 先获取关注username的人的总数,来确定分页的范围
homeparsed = self.parse_homepage(username=username)
print(homeparsed)
followers_number = homeparsed[1]
pages = math.ceil(followers_number / 20)
# 设置一个集合,去除重复元素
follower_result = []
counter = 1
for offset in range(pages):
tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followers?offset={offset}&limit=20'.format(
username=username, offset=offset * 20)
response = requests.get(url=tempurl, headers=self.headers)
if response.status_code == 200:
data = json.loads(response.text)
followees = data['data']
for followee in followees:
# print(counter, ": ", followee['url_token'])
follower_result.append(followee['url_token'])
counter += 1
else:
print(response.status_code)
# 返回无重复的username所关注的人列表
return list(set(follower_result))
if __name__ == '__main__':
spider = Spider()
# spider.get_followees(username='tianshansoft')
# spider.parse_homepage(username='zhi-ai-89-18')
# location = spider.get_location_edu(username='zhi-ai-89-18')
# print(location)
# print(spider.parse_homepage(username='tianshansoft'))
# followee_result = spider.get_followees(username='tianshansoft')
# print(followee_result)
# print(len(followee_result))
followers_result = spider.get_followers(username='tianshansoft')
print(len(followers_result))
print(followers_result[:100])
数据库
数据库为了更加简单,方便。这里就采用sqlite3好了。因为本次的需求很简单,所以只需要一张表就可以了。
create table user(
id INTEGER not null primary key autoincrement,
username varchar(36) not null,
location varchar(255),
school varchar(255),
major varchar(255)
);
然后还需要一个数据库工具类,要不然每次都写那么多重复的代码,也没什么意义。
# coding: utf8
# @Author: 郭 璞
# @File: dbhelper.py
# @Time: 2017/5/22
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 数据库相关操作工具类
import sqlite3
class DbConfig(object):
DATABASE_FILE_PATH = 'zhihu.db'
class DbHelper(object):
def __init__(self):
self.conn = sqlite3.connect(DbConfig.DATABASE_FILE_PATH)
def create_table(self):
# 自增字段关键字AUTOINCREMENT.
sql = """
create table user(
id INTEGER not null primary key autoincrement,
username varchar(36) not null,
location varchar(255),
school varchar(255),
major varchar(255)
);
"""
cursor = self.conn.cursor()
cursor.execute(sql)
cursor.close()
def add(self, data=()):
cursor = self.conn.cursor()
sql = "insert into user(name, location, school, major) values('{}', '{}', '{}', '{}');".format(data[0], data[1], data[2], data[3])
cursor.executescript(sql)
# cursor.execute(sql)
cursor.close()
def get_data(self):
cursor = self.conn.cursor()
sql = "select location, count(location) as numbers from user group by location"
cursor.execute(sql)
resultset = cursor.fetchall()
print(resultset)
if __name__ == '__main__':
dbhelper = DbHelper()
# dbhelper.create_table()
# data = {
# 'username': 'zhi-ai-89-18',
# 'location': '大连',
# 'school': '大连理工大学',
# 'major':'软件',
# }
# data = ('tianshansoft', '上海', 'weizhi', 'software')
# dbhelper.add(data=data)
dbhelper.get_data()
这里简单的以需求驱动开发,我需要的功能也就存储数据,查询数据,所以这个工具类写的很简单。但是从功能上来说,却是足够了。
最后来看看,之乎用户地区的人数分布情况。用到的SQL语句如下:
select location, count(location) as numbers from user group by location ORDER BY numbers DESC
结果如下:
调度器
调度器是一个概念化的名词。作用就是粘合爬虫和数据持久层。根据六度空间理论,社交网是一个超大的互联。所以基本上来说爬虫是爬不干净所有用户的,于是只能退而求其次,爬取一部分吧。虽然是一部分,但是这还是相当于随机抽样,部分与整体的差别不会很大。
下面简要的来做下调度(说是简要,是因为没有做去重操作)
# coding: utf8
# @Author: 郭 璞
# @File: scheduler.py
# @Time: 2017/5/22
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 程序调度器,用于粘合各个模块,实现配合工作。
import spider
import dbhelper
import time, random
sp = spider.Spider()
entrance = 'ghostcomputing'
queue = [entrance]
container = []
LEVEL = 3
counter = 0
dbhelper = dbhelper.DbHelper()
while queue:
if counter>=10000:
break
else:
temp = queue.pop(0)
followees = sp.get_followees(username=temp)
queue.extend(followees)
counter += (len(followees)-1)
# 随即休眠
timeseed = random.randint(1, 5)
print('随即休眠{}秒!'.format(timeseed))
time.sleep(timeseed)
# 获取关注username的人的详细信息
for index, followee in enumerate(followees):
# container.append(sp.get_location_edu(username=followee))
data = sp.get_location_edu(username=followee)
dbhelper.add(data=data)
print('{} 信息获取完成'.format(followee))
# 随即休眠
if index%28==0:
timeseed = random.randint(1, 3)
print('随即休眠{}秒!'.format(timeseed))
time.sleep(timeseed)
print(container)
web服务
echarts最好的使用就是前后端分离,所以使用接口技术来为前端的图标提供数据是一个不错的选择。之前写过一个用PHP做后台提供数据的,这里同样可以。使用JQuery也很方便。
不过,这里我打算试用一下Flask,更加的轻量。但是使用之前需要注意一个问题,那就是对于模板引擎来说,HTML代码已经不能算是原来的HTML代码了,其中对于JavaScript, CSS这些文件的路径要手动处理一下,否则他们无法被正确的找到。
函数 : url_for("static的path,一般为static", filename="想要在src上显示的值,通常是改文件在static中的路径")
比如:我想要一个<script src="echarts.js">
那么:
模板中要这么写: <script src="{{ echarts_path}}">
在后台就可以这么写:
echarts_path = url_for('static', filename='echarts.js')
return render_template('index.html', echarts_path=echarts_path)
明白了这一点,就可以把脚本和样式应用到我们自己的模板上了。
http://echarts.baidu.com/echarts2/doc/example/map15.html
而我只画出了一个中国地图。。。 。。。
待做… …
总结
回顾就是,爬虫那块对接口的数据获取,操作sqlite3,以及web服务中静态资源的显示。其他图形化展示继续加油。