前言
近段时间以来,听群友博友都在谈论着一件事:“CSDN博客怎么没有备份功能啊?”。这其实也在一定程度上表征着大家对于文章这种知识性产品的重视度越来越高,也对于数据的安全提高了重视。
所以我就尝试着写了这么一个工具。专门用来备份CSDN博友的博客。
核心
说起来是核心,其实也就那么回事吧。严格来说也就是一对代码,不能称之为核心啦。
登录模块
为什么需要登陆模块可能是正在看这篇文章的你的第一个疑惑之处。
其实原因是这样的,如果没有登录的话,从博文接口那里是获取不到相关的文章内容的。所以为了更省事,就添加了一个获取登录之后的session来帮助我们爬取文章内容。
不过也不用担心账号密码的安全性什么的,这个工具不会记忆关于您的任何信息。可以放心使用(不信可以看看代码哈)。
登录模块的代码部分也很简单,就是一个模拟登陆CSDN的逻辑实现。
# coding: utf8
# @Author: 郭 璞
# @File: login.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: CSDN login for returning the same session for backing up the blogs.
import requests
from bs4 import BeautifulSoup
import json
class Login(object):
"""
Get the same session for blog's backing up. Need the special username and password of your account.
"""
def __init__(self, username, password):
if username and password:
self.username = username
self.password = password
# the common headers for this login operation.
self.headers = {
'Host': 'passport.csdn.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
}
else:
raise Exception('Need Your username and password!')
def login(self):
loginurl = 'https://passport.csdn.net/account/login'
# get the 'token' for webflow
self.session = requests.Session()
response = self.session.get(url=loginurl, headers=self.headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Assemble the data for posting operation used in logining.
self.token = soup.find('input', {'name': 'lt'})['value']
payload = {
'username': self.username,
'password': self.password,
'lt': self.token,
'execution': soup.find('input', {'name': 'execution'})['value'],
'_eventId': 'submit'
}
response = self.session.post(url=loginurl, data=payload, headers=self.headers)
# get the session
return self.session if response.status_code==200 else None
def getSource(self, url):
"""
测试内容, 可删去,(*^__^*) 嘻嘻……
:param url:
:return:
"""
username, id = url.split('/')[3], url.split('/')[-1]
# print(username, id)
backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
tempheaders = self.headers
tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
tempheaders['Host'] = 'write.blog.csdn.net'
tempheaders['X-Requested-With'] = 'XMLHttpRequest'
response = self.session.get(url=backupurl, headers=tempheaders)
soup = json.loads(response.text)
return {
'title': soup['data']['title'],
'markdowncontent': soup['data']['markdowncontent'],
}
通过模拟登陆,获取到一个已登录状态的session就可以了,接下来会用得到。
备份模块
一开始我想的是直接获取网页的源码,解析出相应的文章段内容,然后通过一些逻辑实现HTML代码到Markdown文件的转换,但是对于复杂内容的HTML代码,嵌套的层次也比较深,对于表格形式更是有点心有余而力不足。所以技术上还是有难度。
然后很偶然的发现了可以通过这么一个接口来获取到文章相关的json数据,里面包括了文章标题,文章初始的Markdown文件内容。
'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
这简直是太方便了。然后下面是具体的备份逻辑。
# coding: utf8
# @Author: 郭 璞
# @File: backup.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Back up the blog for getting and stroaging the markdown file.
import json
import os
import re
class Backup(object):
"""
Get the special url for getting markdown file.
"""
def __init__(self, session, backupurl):
self.headers = {
'Referer': 'http://write.blog.csdn.net/mdeditor',
'Host': 'passport.csdn.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
}
# constructor the url: get article id and the username
# http://blog.csdn.net/marksinoberg/article/details/70432419
username, id = backupurl.split('/')[3], backupurl.split('/')[-1]
self.backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
self.session = session
def getSource(self):
# get title and content for the assigned url.
tempheaders = self.headers
tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
tempheaders['Host'] = 'write.blog.csdn.net'
tempheaders['X-Requested-With'] = 'XMLHttpRequest'
response = self.session.get(url=self.backupurl, headers=tempheaders)
soup = json.loads(response.text)
return {
'title': soup['data']['title'],
'markdowncontent': soup['data']['markdowncontent'],
}
def downloadpic(self, picurl, outputpath):
tempheaders = self.headers
tempheaders['Host'] = 'img.blog.csdn.net'
tempheaders['Upgrade-Insecure-Requests'] = '1'
response = self.session.get(url=picurl, headers=tempheaders)
print(response.status_code)
# change the seperator of your OS
outputpath = outputpath.replace(os.sep, '/')
print(outputpath)
if response.status_code == 200:
with open(outputpath, 'wb') as f:
f.write(response.content)
f.close()
print("{} saved in {} succeed!".format(picurl, outputpath))
else:
raise Exception("Picture Url: {} downloading failed!".format(picurl))
def getpicurls(self):
pattern = re.compile("\!\[.*?\]\((.*)?\)")
markdowncontent = self.getSource()['markdowncontent']
return re.findall(pattern=pattern, string=markdowncontent)
def backup(self, outputpath='./'):
try:
source = self.getSource()
foldername = source['title']
foldername = os.path.join(outputpath, foldername)
if not os.path.exists(foldername):
os.mkdir(foldername)
# write file
filename = os.path.join(foldername, source['title'])
with open(filename+".md", 'w', encoding='utf8') as f:
f.write(source['markdowncontent'])
f.close()
# save pictures
imgfolder = os.path.join(foldername, 'img')
if not os.path.exists(imgfolder):
os.mkdir(imgfolder)
for index, picurl in enumerate(self.getpicurls()):
imgpath = imgfolder + os.sep+str(index)+'.png'
try:
self.downloadpic(picurl=picurl, outputpath=imgpath)
except:
# 有可能出现: requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
pass
except Exception as e:
print('恩,又出错了。详细信息为:{}'.format(e))
pass
博文扫描模块
博文扫描模块原理上是不用登录的,根据自己的用户名就可以一层层的获取到所有的博客链接。然后保存下来配合上面的备份逻辑,循环着跑一遍就可以了。
# coding: utf8
# @Author: 郭 璞
# @File: blogscan.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Scan the domain of your blog domain, get the all links of your blogs.
import requests
from bs4 import BeautifulSoup
import re
class BlogScanner(object):
"""
Scan for all blogs
"""
def __init__(self, domain):
self.username = domain
self.rooturl = 'http://blog.csdn.net'
self.bloglinks = []
self.headers = {
'Host': 'blog.csdn.net',
'Upgrade - Insecure - Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
}
def scan(self):
# get the page count
response = requests.get(url=self.rooturl+"/"+self.username, headers=self.headers)
soup = BeautifulSoup(response.text, 'html.parser')
pagecontainer = soup.find('div', {'class': 'pagelist'})
pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]
# construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
for index in range(1, int(pages)+1):
# get the blog link of each list page
listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
response = requests.get(url=listurl, headers=self.headers)
soup = BeautifulSoup(response.text, 'html.parser')
try:
alinks = soup.find_all('span', {'class': 'link_title'})
# print(alinks)
for alink in alinks:
link = alink.find('a').attrs['href']
link = self.rooturl +link
self.bloglinks.append(link)
except Exception as e:
print('出现了点意外!\n'+e)
continue
return self.bloglinks
如此,三大模块就算是搞定了。
演示
接下来演示一下如何使用这个工具吧。
如何使用
第一步肯定是要先下载源代码了。
然后借鉴一下下面的代码
# coding: utf8
# @Author: 郭 璞
# @File: Main.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: The entrance of this blog backup tool.
from csdnbackup.login import Login
from csdnbackup.backup import Backup
from csdnbackup.blogscan import BlogScanner
import random
import time
import getpass
username = input('请输入账户名:')
password = getpass.getpass(prompt='请输入密码:')
loginer = Login(username=username, password=password)
session = loginer.login()
scanner = BlogScanner(username)
links = scanner.scan()
for link in links:
backupper = Backup(session=session, backupurl=link)
timefeed = random.choice([1,3,5,7,2,4,6,8])
print('随即休眠{}秒'.format(timefeed))
time.sleep(timefeed)
backupper.backup(outputpath='./')
- 最后一步
python Main.py
效果
下面看下运行结果。
首先是“总览”(还没测试完,先下载了这几个)
然后是单篇文章
再是文章Markdown内容展示
单篇文章图片内容
图片查看
总结
最后来反思一下这个工具还有那些不足之处。
博客名称引起的创建文件夹异常:这点做了异常处理。
访问过快引起的服务器反制: 添加了随机休眠时延,但不是治本之术。
还未添加日志模块,对于备份失败的文章应该予以记录。在文章备份操作完成后,对错误日志进行解析,再次尝试备份操作。
测试还不够充分,我自己这边虽然可以跑起来,但是对于其他人有可能会出现一些奇奇怪怪的问题。
最后,放下源码链接,有兴趣的给点个star咯。