前言
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。
PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取
开发工具
- Python 3.6.5
- Pycharm
- requests
- re
- json
相关模块可用pip命令安装
网页分析
https://search.51job.com/list/010000%252c020000%252c030200%252c040000,000000,0000,00,9,99,python,2,1.html
请求网页
import requests
url = 'https://search.51job.com/list/010000%252c020000%252c030200%252c040000,000000,0000,00,9,99,python,2,1.html'
params = {
'lang': 'c',
'postchannel': '0000',
'workyear': '99',
'cotype': '99',
'degreefrom': '99',
'jobterm': '99',
'companysize': '99',
'ord_field': '0',
'dibiaoid': '0',
'line': '',
'welfare': '',
}
cookies = {
'''
你的cookie
'''
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Host': 'search.51job.com',
'Referer': 'https://search.51job.com/list/190200,000000,0000,00,9,99,python,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
}
response = requests.get(url=url, params=params, headers=headers, cookies=cookies)
response.encoding = response.apparent_encoding
print(response.text)
咱们需要的数据的在<script>
里面
<script type="text/javascript">
window.__SEARCH_RESULT__ =
'''
你想要获取的内容
'''
<div class="clear"></div>
用正则表达式匹配出来就可以了
把匹配出来的数据转化程json数据,然后根据字典的取值方式取自己想要数据即可
r = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', response.text, re.S)
string = ''.join(r)
info_dict = json.loads(string)
pprint.pprint(info_dict)
完整代码
import requests
import re
import json
for page in range(1, 11):
url = 'https://search.51job.com/list/010000%252c020000%252c030200%252c040000,000000,0000,00,9,99,python,2,{}.html'.format(page)
params = {
'lang': 'c',
'postchannel': '0000',
'workyear': '99',
'cotype': '99',
'degreefrom': '99',
'jobterm': '99',
'companysize': '99',
'ord_field': '0',
'dibiaoid': '0',
'line': '',
'welfare': '',
}
cookies = {
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Host': 'search.51job.com',
'Referer': 'https://search.51job.com/list/190200,000000,0000,00,9,99,python,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
}
response = requests.get(url=url, params=params, headers=headers, cookies=cookies)
response.encoding = response.apparent_encoding
r = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', response.text, re.S)
string = ''.join(r)
info_dict = json.loads(string)
dit_py = info_dict['engine_search_result']
dit = {}
for i in dit_py:
attribute_text = ' '.join(i['attribute_text'][1:])
print(attribute_text)
# dit['job_href'] = i['job_href']
dit['job_name'] = i['job_name']
dit['company_name'] = i['company_name']
dit['money'] = i['providesalary_text']
dit['workarea'] = i['workarea_text']
dit['updatedate'] = i['updatedate']
dit['companytype'] = i['companytype_text']
dit['jobwelf'] = i['jobwelf']
dit['attribute'] = attribute_text
dit['companysize'] = i['companysize_text']
print(dit)
with open('python招聘信息.csv', mode='a', encoding='utf-8') as f:
f.write('{},{},{},{},{},{},{},{}\n'.format(dit['job_name'], dit['company_name'], dit['money'], dit['workarea'], dit['companytype'], dit['jobwelf'], dit['attribute'], dit['companysize']))
实现效果