用python爬取知乎上关于长沙的神回复，结果竟然...

爬取回答过程


import sys
import random
import argparse
import time
import json
import requests
from bs4 import BeautifulSoup
'''
更多Python学习资料以及源码教程资料，可以在群1136201545免费获取
'''
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

db = {"answers": [], "saved_topics": {}}
answer_ids = []
maxnum = 100

def set_maxnum(maxnum1):
    global maxnum
    maxnum = maxnum1

安装好一系列第三方库，禁用安全请求警告，定义一个全局变量maxnum，表示一次性爬取的最大话题数。

def get_answers(topic_id):
    global db
    page_no = 0
    while True:
        is_saved = topic_id in db['saved_topics']
        is_saved = is_saved and page_no in db['saved_topics'][str(topic_id)]
        if is_saved:
            page_no += 1
            continue
        # print(topic_id, page_no)
        print('.', end='')
        sys.stdout.flush()
        is_end = get_answers_by_page(topic_id, page_no)
        page_no += 1
        if is_end:
            break

def get_answers_by_page(topic_id, page_no):
    global db, answer_ids, maxnum
    limit = 10
    offset = page_no * limit
    url = "https://www.zhihu.com/api/v4/topics/" + str(topic_id) + "/feeds/essence?include=data%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Danswer)%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Danswer)%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Darticle)%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dtopic_sticky_module)%5D.target.data%5B%3F(target.type%3Dpeople)%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Danswer)%5D.target.annotation_detail%2Ccontent%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F(target.type%3Danswer)%5D.target.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Darticle)%5D.target.annotation_detail%2Ccontent%2Cauthor.badge%5B%3F(type%3Dbest_answerer)%5D.topics%3Bdata%5B%3F(target.type%3Dquestion)%5D.target.annotation_detail%2Ccomment_count&limit="+str(limit)+"&offset=" + str(offset)
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    }
    try:
        r = requests.get(url, verify=False, headers=headers)
    except requests.exceptions.ConnectionError:
        return False
    content = r.content.decode("utf-8")
    data = json.loads(content)
    is_end = data["paging"]["is_end"]
    items = data["data"]
    if len(items) <= 0:
        return True

    for item in items:
        if maxnum <= 0:
            return True
        answer_id = item["target"]["id"]
        if answer_id in answer_ids:
            continue
        if item['target']['type'] != "answer":
            continue
        if int(item["target"]["voteup_count"]) < 10000:
            continue
        if len(item['target']['content']) > 300:
            continue
        answer_ids.append(answer_id)
        question = item["target"]["question"]["title"]
        answer = item["target"]["content"]
        if answer.find('<') > -1 and answer.find('>') > -1:
            pass
            # answer = BeautifulSoup(answer, 'html.parser').string
        print("\nQ: {}\nA: {}".format(question, answer))
        maxnum -= 1
    return is_end

第一个核心函数是：get_answers_by_page，该函数有两个参数，第一个是话题的id，第二个是表示爬的是第几页的内容。

我们对要爬取的目标内容限制为：内容字数不超过300，点赞人数不低于10K，在代码中：

voteup_count 表示点赞数，content 表示内容字数。

def query():
    items = db['answers']
    for item in items:
        question = item["target"]["question"]["title"]
        answer = item["target"]["content"]
        print("\nQ: {}\nA: {}".format(question, answer))

定义query函数用于对爬取的问题及答案，按照Q：\（换行）A：的形式展示出来。.

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--max", help="max data",
                        action="store_true", dest="max", default=100)
    args = parser.parse_args()
    set_maxnum(args.max)
    topic_ids = [19555480]
    topic_ids.sort(key=lambda x: random.randint(-1, 1))
    for topic_id in topic_ids:
        pass
        get_answers(topic_id)

主函数代码如上，topic_ids = [ ]内参数为知乎的话题号，每一个号码对应一个特定的话题。可以通过修改话题号选择你要爬取的内容。话题号在网页的位置如下：
在这里插入图片描述
以上就是所有代码，了解了上述工作，大家就可以在自己电脑上实战运行爬虫了。
我们看看爬取结果吧！

Q: 湖南、贵州、四川、重庆、江西等地的人，小时候是如何学会吃辣的？

A: 此为出厂设置born to be pepperoud ，天生傲椒。

Q: 如何快速和湖南人打成一片？

A: 你骂他们，就可以打成一片了。

Q: 长沙有什么好吃而且不是很辣的美食？

A: 走进任意一家餐厅和服务员说不要辣/少辣，过一会儿师傅就会拿着刀出来问你：「那给你炒好不咯！」

Q: 用一句话形容凤凰古城？

A: 上厕所收费，便宜的三块，贵的五块。

Q: 中国国防科技大学是所怎么样的学校？

A: 一所没人敢回答的学校…

Q: 湖南人性格怎样呢？

A: 属于我的，一定会属于我。不属于我的，只要霸点蛮，也一样会属于我。

Q: 为什么长沙人说的普通话跟标准普通话发音音调完全不一样？

A: 因为说标普会被呸。

Q: 长沙是一个怎样的城市？

A: 一场大火烧了城于是所有的底蕴都铸进人的骨子里并用热辣的性格掩盖这是长沙人保护长沙的方式。

Q: 如何评价2017年湖南高考录取分数线！？以及2018年湖南高考录取分数线！？

A: 叫你们平时别总去转发加分锦鲤，看吧，这次全部加到分数线了。

Q: 在湖南如何能不吃辣?

A: 不要想太多了，湖南人的锅都是辣的。

Biegral Blog

用python爬取知乎上关于长沙的神回复，结果竟然...

阅读排行

分类

归档