Scrapy爬取知名问答网站

session和cookie自动登录机制

cookie是请求服务器时,服务器为了分辨不同的客户端,返回给客户端的东西,以字典的形式存储在客户端本地。

之后每次客户端发出请求时,都会带上cookie信息,这样服务器就是识别是哪个客户端发出的请求。

session是存储在服务器端的,是用户登录后,服务器生成的用户身份识别数据。一般包括sessionID ,sessionData,过期时间。

过程是:用户登录后,则服务器生成session后,并把sessionID返回给客户端存在cookie中,并在之后的请求-响应互动中,cookie中都含有sessionID。

selenium模拟登录知乎

安装selenium打开网址

安装selenium:pip install selenium

在spiders文件夹下新建zhihu.py模块,创建ZhihuSpider类

1
2
3
4
5
6
7
8
9
10
import scrapy
from selenium import webdriver

class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/']

def parse(self, response):
pass

每次执行之前都要用到start_requests方法,所以我们需要对其覆盖,发送账号和密码到知乎:

1
2
3
4
5
6
7
8
def start_requests(self):
browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
browser.get("https://www.zhihu.com/signin")
browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[1]/div[2]').click()
browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[2]/div/label/input').send_keys("shy_kevin@qq.com")
browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[3]/div/label/input').send_keys("zuiaimeiyi520")
browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/button').click()
time.sleep(60)

模拟登录chromedriver被识别怎么办?

虽然chromedriver已经可以可以使用浏览器登录了,但是由于浏览器还是被chromedriver控制的,chromedriver有一些特性可以被js感知到,所以很多网站可以在网站中加入js逻辑来判断当前的浏览器是否是由driver控制,比如检测是否存在特有标识$cdc_lasutopfhvcZLmcflwindow.navigator.webdriver

①解决上面的问题$cdc_lasutopfhvcZLmcfl需要去修改driver文件,具体修改方法:(https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver)

②解决window.navigator.webdriver的方法可以通过:

1
2
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation']) #这里去掉window.navigator.webdriver的特性

1
2
3
4
def start_requests(self):
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation']) # 这里去掉window.navigator.webdriver的特性
browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe",options=option)

ctrl+a选中后再输入,避免重复输入

1
2
3
from selenium.webdriver.common.keys import Keys

browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[2]/div/label/input').send_keys(Keys.CONTROL+"a")

安装mouse模块获取坐标位置

安装mouse模块获取坐标位置pip install mouse

1
2
3
4
from mouse import move,click

move(895,603)
click()

输出cookies信息到文本中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def start_requests(self):
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation']) # 这里去掉window.navigator.webdriver的特性
browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe",options=option)
browser.get("https://www.zhihu.com/signin")

cookies = browser.get_cookies()
import pickle

#以二进制格式打开一个文件用于读写。如果该文件已存在则打开文件,并从开头开始编辑,即原有内容会被删除。如果该文件不存在,创建新文件。
pickle.dump(cookies,open("C:/Users/yjw55/ArticleSpider/ArticleSpider/cookies/zhihu.cookie","wb+"))
cookie_dic = {}
for cookie in cookies:
cookie_dic[cookie["name"]] = cookie["value"]
return [scrapy.Request(url=self.start_urls[0],dont_filter=True,cookies=cookie_dic)]

关于settings的一些配置

①scrapy中的cookies参数详解
COOKIES_ENABLED

  • 默认: True
  • 是否启用cookiesmiddleware。如果关闭,cookies将不会发送给web server。

COOKIES_DEBUG

  • 默认: False
  • 如果启用,Scrapy将记录所有在request(cookie 请求头)发送的cookies及response接收到的cookies(set-cookie接收头)

USER_AGENT请求头设置

1
2
3
4
5
6
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 3,
}

读取cookies信息到selenium

1
2
3
4
cookies = browser.get_cookies()
import pickle

pickle.load(cookies,open("C:/Users/yjw55/ArticleSpider/ArticleSpider/cookies/zhihu.cookie","rb"))

知乎倒立文字识别

在项目根目录下执行pip install -i https://pypi.doubanio.com/simple/ -r requirements.txt
(使用豆瓣源安装)

将倒立文字的坐标轴纠正成容易理解的方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from zheye import zheye

z = zheye()
positions = z.Recognize('zhihu_image/captcha.gif')

last_positions = []
if len(positions) == 2:
if positions[0][1] > positions[1][1]:
last_positions.append([positions[1][1], positions[1][0]])
last_positions.append([positions[0][1], positions[0][0]])
else:
last_positions.append([positions[0][1], positions[0][0]])
last_positions.append([positions[1][1], positions[1][0]])
else:
last_positions.append([positions[0][1],positions[0][0]])

print(last_positions)

selenium自动识别验证码完成模拟登录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

def start_requests(self):
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_experimental_option("excludeSwitches",['enable-automation'])
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")

browser = webdriver.Chrome(executable_path="E:/chromedriver/chromedriver_win32/chromedriver.exe", chrome_options=chrome_options)
# browser = webdriver.Chrome(executable_path="E:/chromedriver/chromedriver_win32/chromedriver.exe")
import time
try:
browser.maximize_window() #将窗口最大化防止定位错误
except:
pass
browser.get("https://www.zhihu.com/signin")
logo_element = browser.find_element_by_class_name("SignFlowHeader")
# y_relative_coord = logo_element.location['y']
#此处一定不要将浏览器放大 会造成高度获取失败!!!
browser_navigation_panel_height = browser.execute_script('return window.outerHeight - window.innerHeight;')
time.sleep(5)
browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
"18782902568")

browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
"admin13")

browser.find_element_by_css_selector(
".Button.SignFlow-submitButton").click()
time.sleep(15)
from mouse import move, click
# move(800, 400 ,True)
# actions = ActionChains(browser)
# actions.move_to_element(browser.find_element_by_css_selector(
# ".Button.SignFlow-submitButton"))
# actions.click(browser.find_element_by_css_selector(
# ".Button.SignFlow-submitButton"))
# actions.perform()
# actions.move_to_element_with_offset(browser.find_element_by_css_selector(
# ".Button.SignFlow-submitButton"), 30, 30).perform()
#chrome的版本问题有两种解决方案
#1. 自己启动chrome(推荐) 可以防止chromedriver被识别,因为chromedriver出现的一些js变量可以被服务器识别出来
#2. 使用chrome60(版本)

# 先判断是否登录成功
login_success = False
while not login_success:
try:
notify_element = browser.find_element_by_class_name("Popover PushNotifications AppHeader-notifications")
login_success = True
except:
pass

try:
#查询是否有英文验证码
english_captcha_element = browser.find_element_by_class_name("Captcha-englishImg")
except:
english_captcha_element = None
try:
# 查询是否有中文验证码
chinese_captcha_element = browser.find_element_by_class_name("Captcha-chineseImg")
except:
chinese_captcha_element = None

if chinese_captcha_element:
y_relative_coord = chinese_captcha_element.location['y']
y_absolute_coord = y_relative_coord + browser_navigation_panel_height
x_absolute_coord = chinese_captcha_element.location['x']
# x_absolute_coord = 842
# y_absolute_coord = 428

"""
保存图片
1. 通过保存base64编码
2. 通过crop方法
"""
# 1. 通过保存base64编码
base64_text = chinese_captcha_element.get_attribute("src")
import base64
code = base64_text.replace('data:image/jpg;base64,', '').replace("%0A", "")
# print code
fh = open("yzm_cn.jpeg", "wb")
fh.write(base64.b64decode(code))
fh.close()

from zheye import zheye
z = zheye()
positions = z.Recognize("yzm_cn.jpeg")

pos_arr = []
if len(positions) == 2:
if positions[0][1] > positions[1][1]:
pos_arr.append([positions[1][1], positions[1][0]])
pos_arr.append([positions[0][1], positions[0][0]])
else:
pos_arr.append([positions[0][1], positions[0][0]])
pos_arr.append([positions[1][1], positions[1][0]])
else:
pos_arr.append([positions[0][1], positions[0][0]])

if len(positions) == 2:
first_point = [int(pos_arr[0][0] / 2), int(pos_arr[0][1] / 2)]
second_point = [int(pos_arr[1][0] / 2), int(pos_arr[1][1] / 2)]

move((x_absolute_coord + first_point[0]), y_absolute_coord + first_point[1])
click()

move((x_absolute_coord + second_point[0]), y_absolute_coord + second_point[1])
click()

else:
first_point = [int(pos_arr[0][0] / 2), int(pos_arr[0][1] / 2)]

move((x_absolute_coord + first_point[0]), y_absolute_coord + first_point[1])
click()

browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
"18782902568")

browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
"admin1234")
browser.find_element_by_css_selector(
".Button.SignFlow-submitButton").click()
browser.find_element_by_css_selector(
".Button.SignFlow-submitButton").click()

if english_captcha_element:
# 2. 通过crop方法
# from pil import Image
# image = Image.open(path)
# image = image.crop((locations["x"], locations["y"], locations["x"] + image_size["width"],
# locations["y"] + image_size["height"])) # defines crop points
#
# rgb_im = image.convert('RGB')
# rgb_im.save("D:/ImoocProjects/python_scrapy/coding-92/ArticleSpider/tools/image/yzm.jpeg",
# 'jpeg') # saves new cropped image
# # 1. 通过保存base64编码
base64_text = english_captcha_element.get_attribute("src")
import base64
code = base64_text.replace('data:image/jpg;base64,', '').replace("%0A", "")
# print code
fh = open("yzm_en.jpeg", "wb")
fh.write(base64.b64decode(code))
fh.close()

from tools.yundama_requests import YDMHttp
yundama = YDMHttp("da_ge_da1", "dageda", 3129, "40d5ad41c047179fc797631e3b9c3025")
code = yundama.decode("yzm_en.jpeg", 5000, 60)
while True:
if code == "":
code = yundama.decode("yzm_en.jpeg", 5000, 60)
else:
break
browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[2]/div[1]/form/div[3]/div/div/div[1]/input').send_keys(code)

browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
"18782902568")

browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
"admin1234")
submit_ele = browser.find_element_by_css_selector(".Button.SignFlow-submitButton")
browser.find_element_by_css_selector(".Button.SignFlow-submitButton").click()

time.sleep(10)
try:
notify_element = browser.find_element_by_class_name("Popover PushNotifications AppHeader-notifications")
login_success = True

Cookies = browser.get_cookies()
print(Cookies)
cookie_dict = {}
import pickle
for cookie in Cookies:
# 写入文件
# 此处大家修改一下自己文件的所在路径
f = open('./ArticleSpider/cookies/zhihu/' + cookie['name'] + '.zhihu', 'wb')
pickle.dump(cookie, f)
f.close()
cookie_dict[cookie['name']] = cookie['value']
browser.close()
return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)]
except:
pass

print("yes")

requests模拟登陆知乎(可选)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#sunlands_login_requests.py

import requests
try:
import cookielib
except:
import http.cookiejar as cookielib

import re

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")
try:
session.cookies.load(ignore_discard=True)
except:
print ("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': agent
}

def is_login():
#通过个人中心页面返回状态码来判断是否为登录状态
inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
response = session.get(inbox_url, headers=header, allow_redirects=False)
if response.status_code != 200:
return False
else:
return True

def get_xsrf():
#获取xsrf code
response = session.get("https://www.zhihu.com", headers=header)
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
if match_obj:
return (match_obj.group(1))
else:
return ""


def get_index():
response = session.get("https://www.zhihu.com", headers=header)
with open("index_page.html", "wb") as f:
f.write(response.text.encode("utf-8"))
print ("ok")

def get_captcha():
import time
t = str(int(time.time()*1000))
captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
t = session.get(captcha_url, headers=header)
with open("captcha.jpg","wb") as f:
f.write(t.content)
f.close()

from PIL import Image
try:
im = Image.open('captcha.jpg')
im.show()
im.close()
except:
pass

captcha = input("输入验证码\n>")
return captcha

def zhihu_login(account, password):
#知乎登录
if re.match("^1\d{10}",account):
print ("手机号码登录")
post_url = "https://www.zhihu.com/login/phone_num"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password,
"captcha":get_captcha()
}
else:
if "@" in account:
#判断用户名是否为邮箱
print("邮箱方式登录")
post_url = "https://www.zhihu.com/login/email"
post_data = {
"_xsrf": get_xsrf(),
"email": account,
"password": password
}

response_text = session.post(post_url, data=post_data, headers=header)
session.cookies.save()

zhihu_login("18782902568", "admin123")
# get_index()
is_login()

scrapy模拟知乎登录(可选)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def start_requests(self):
return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]

def login(self, response):
response_text = response.text
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response_text, re.DOTALL)
xsrf = ''
if match_obj:
xsrf = (match_obj.group(1))

if xsrf:
post_url = "https://www.zhihu.com/login/phone_num"
post_data = {
"_xsrf": xsrf,
"phone_num": "",
"password": "",
"captcha": ""
}

import time
t = str(int(time.time() * 1000))
captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
yield scrapy.Request(captcha_url, headers=self.headers, meta={"post_data":post_data}, callback=self.login_after_captcha)


def login_after_captcha(self, response):
with open("captcha.jpg", "wb") as f:
f.write(response.body)
f.close()

from PIL import Image
try:
im = Image.open('captcha.jpg')
im.show()
im.close()
except:
pass

captcha = input("输入验证码\n>")

post_data = response.meta.get("post_data", {})
post_url = "https://www.zhihu.com/login/phone_num"
post_data["captcha"] = captcha
return [scrapy.FormRequest(
url=post_url,
formdata=post_data,
headers=self.headers,
callback=self.check_login
)]

def check_login(self, response):
#验证服务器的返回数据判断是否成功
text_json = json.loads(response.text)
if "msg" in text_json and text_json["msg"] == "登录成功":
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, headers=self.headers)

知乎携带cookie请求登录

1
2
3
4
5
6
7
8
9
10
#请求命令从这里开始
def start_requests(self):
# 该cookie是登陆后获取的cookie字符串
cookies = '''
_zap=e1a0721a-d85f-4636-b109-549fe2920bbd; _xsrf=fJjAzECZacKvHcUuS4slEC0HpoAfCwFB;
'''
# 转换成字典
cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
# 默认yield到parse的,可以不写callback=self.parse
yield scrapy.Request('https://www.zhihu.com/',headers=self.headers,cookies=cookies)

settings.py(配置文件,不要禁用cookie):

1
2
3
# 禁用cookie (默认开启cookie)
# COOKIES_ENABLED = False
COOKIES_DEBUG = True # 开启cookie的调试信息

yield理解

  • 如果是yield item 会到pipelins中处理;
  • 如果是yield.Request会到下载器去下载。

知乎爬虫逻辑的实现以及item loder方式提取

工具类和settings.py的必要设置

utils.common工具类:

1
2
3
4
5
6
7
8
9
10
11
#common.py

def extract_num(text):
#从字符串中提取出数字
match_re = re.match(".*?(\d+).*", text)
if match_re:
nums = int(match_re.group(1))
else:
nums = 0

return nums

设置日期输出格式和ITEM_PIPELINES,settings.py

1
2
3
4
5
6
ITEM_PIPELINES = {
'ArticleSpider.pipelines.MysqlTwistedPipline': 4,
}

SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"

知乎spider爬虫逻辑的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#zhihu.py
import scrapy
from urllib import parse
import re
import datetime
from items import ZhihuQuestionItem,ZhihuAnswerItem
from scrapy.loader import ItemLoader
import json

class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/']

# question的第一页answer的请求url
start_answer_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={1}&offset={2}&platform=desktop&sort_by=default"


headers = {
"HOST": "www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
}

def parse(self, response):
"""
提取出html页面中的所有url 并跟踪这些url进行一步爬取
如果提取的url中格式为 /question/xxx 就下载之后直接进入解析函数
"""
all_urls = response.css("a::attr(href)").extract()
all_urls = [parse.urljoin(response.url,url) for url in all_urls]
#filter() 函数用于过滤序列,过滤掉不符合条件的元素,匿名函数传递入x
all_urls = filter(lambda x:True if x.startswith("https") else False, all_urls)
for url in all_urls:
match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url)
if match_obj:
request_url = match_obj.group(1)
yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question)
# break
else:
# 如果不是question页面则直接进一步跟踪
yield scrapy.Request(url,headers=self.headers,callback=self.parse)

pass


def parse_question(self,response):
# 处理question页面, 从页面中提取出具体的question item
match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)
if match_obj:
question_id = int(match_obj.group(2))

item_loader = ItemLoader(item=ZhihuQuestionItem(),response=response)
#从 css 提取出的数据,传递给 输入处理器 的 title 字段.输入处理器的结果被收集和保存在Item Loader中(但尚未分配给该Item)。
item_loader.add_css("title", "h1.QuestionHeader-title::text")
item_loader.add_css("content", ".QuestionHeader-detail")
item_loader.add_value("url", response.url)
item_loader.add_value("zhihu_id", question_id)
item_loader.add_css("answer_num", ".List-headerText span::text")
item_loader.add_css("topics", ".QuestionHeader-topics .Popover div::text")

question_item = item_loader.load_item()

yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0), headers=self.headers,
callback=self.parse_answer)
#分配给item,走管道
yield question_item


def parse_answer(self,response):
# 处理question的answer
ans_json = json.loads(response.text)
is_end = ans_json["paging"]["is_end"]
next_url = ans_json["paging"]["next"]

# 提取answer的具体字段
for answer in ans_json["data"]:
answer_item = ZhihuAnswerItem()

answer_item["zhihu_id"] = answer["id"]
answer_item["url"] = answer["url"]
answer_item["question_id"] = answer["question"]["id"]
answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None
answer_item["content"] = answer["content"] if "content" in answer else None
answer_item["parise_num"] = answer["voteup_count"]
answer_item["comments_num"] = answer["comment_count"]
answer_item["create_time"] = answer["created_time"]
answer_item["update_time"] = answer["updated_time"]
answer_item["crawl_time"] = datetime.datetime.now()

yield answer_item

pass
if not is_end:
yield scrapy.Request(next_url,headers=self.headers,callback=self.parse_answer)


#请求命令从这里开始
def start_requests(self):
# 该cookie是登陆后获取的cookie字符串
cookies = '''
_zap=e1a0721a-d85f-4636-b109-549fe2920bbd; _xsrf=fJjAzECZacKvHcUuS4slEC0HpoAfCwFB; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1590331566,1590365636,1590366632,1590372221; '''
# 转换成字典
cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
# 默认yield到parse的,可以不写callback=self.parse
yield scrapy.Request('https://www.zhihu.com/',headers=self.headers,cookies=cookies)

item loder方式提取(items.py)与保存数据到mysql中

采用异步的机制写入mysql:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#pipelines.py
import MySQLdb
import MySQLdb.cursors

class MysqlTwistedPipline(object):
def __init__(self, dbpool):
self.dbpool = dbpool

@classmethod
def from_settings(cls, settings):
dbparms = dict(
host = settings["MYSQL_HOST"],
db = settings["MYSQL_DBNAME"],
user = settings["MYSQL_USER"],
passwd = settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
)
dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)

return cls(dbpool)

def process_item(self, item, spider):
#使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) #处理异常
return item

def handle_error(self, failure, item, spider):
#处理异步插入的异常
print (failure)

def do_insert(self, cursor, item):
#执行具体的插入
#根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)

item loder方式提取(items.py):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# items.py
import datetime

from utils.common import extract_num
from settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT

class ZhihuQuestionItem(scrapy.Item):
# 知乎的问题 item
zhihu_id = scrapy.Field()
topics = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
answer_num = scrapy.Field()
crawl_time = scrapy.Field()

def get_insert_sql(self):
#插入知乎question表的sql语句
insert_sql = """
insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num,crawl_time
)
VALUES (%s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE content=VALUES(content), answer_num=VALUES(answer_num)
"""
zhihu_id = self["zhihu_id"][0]
topics = ",".join(self["topics"])
url = self["url"][0]
title = "".join(self["title"])
content = "".join(self["content"])
answer_num = extract_num("".join(self["answer_num"]))

crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)

params = (zhihu_id,topics,url,title,content,answer_num,crawl_time)

return insert_sql, params

class ZhihuAnswerItem(scrapy.Item):
#知乎的问题回答item
zhihu_id = scrapy.Field()
url = scrapy.Field()
question_id = scrapy.Field()
author_id = scrapy.Field()
content = scrapy.Field()
parise_num = scrapy.Field()
comments_num = scrapy.Field()
create_time = scrapy.Field()
update_time = scrapy.Field()
crawl_time = scrapy.Field()

def get_insert_sql(self):
#插入知乎question表的sql语句
insert_sql = """
insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,
create_time, update_time, crawl_time
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE content=VALUES(content), comments_num=VALUES(comments_num), parise_num=VALUES(parise_num),
update_time=VALUES(update_time)
"""

create_time = datetime.datetime.fromtimestamp(self["create_time"]).strftime(SQL_DATETIME_FORMAT)
update_time = datetime.datetime.fromtimestamp(self["update_time"]).strftime(SQL_DATETIME_FORMAT)
params = (
self["zhihu_id"], self["url"], self["question_id"],
self["author_id"], self["content"], self["parise_num"],
self["comments_num"], create_time, update_time,
self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)

return insert_sql, params