session和cookie自动登录机制

cookie是请求服务器时,服务器为了分辨不同的客户端,返回给客户端的东西,以字典的形式存储在客户端本地。

之后每次客户端发出请求时,都会带上cookie信息,这样服务器就是识别是哪个客户端发出的请求。

session是存储在服务器端的,是用户登录后,服务器生成的用户身份识别数据。一般包括sessionID ,sessionData,过期时间。

过程是:用户登录后,则服务器生成session后,并把sessionID返回给客户端存在cookie中,并在之后的请求-响应互动中,cookie中都含有sessionID。

selenium模拟登录知乎

安装selenium打开网址

安装selenium:pip install selenium

在spiders文件夹下新建zhihu.py模块,创建ZhihuSpider类

import scrapy
from selenium import webdriver

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']

    def parse(self, response):
        pass

每次执行之前都要用到start_requests方法,所以我们需要对其覆盖,发送账号和密码到知乎:

def start_requests(self):
    browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    browser.get("https://www.zhihu.com/signin")
    browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[1]/div[2]').click()
    browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[2]/div/label/input').send_keys("shy_kevin@qq.com")
    browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[3]/div/label/input').send_keys("zuiaimeiyi520")
    browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/button').click()
    time.sleep(60)

模拟登录chromedriver被识别怎么办?

虽然chromedriver已经可以可以使用浏览器登录了,但是由于浏览器还是被chromedriver控制的,chromedriver有一些特性可以被js感知到,所以很多网站可以在网站中加入js逻辑来判断当前的浏览器是否是由driver控制,比如检测是否存在特有标识$cdc_lasutopfhvcZLmcflwindow.navigator.webdriver

①解决上面的问题$cdc_lasutopfhvcZLmcfl需要去修改driver文件,具体修改方法:(https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver)

②解决window.navigator.webdriver的方法可以通过:

option = webdriver.ChromeOptions()
   option.add_experimental_option('excludeSwitches', ['enable-automation']) #这里去掉window.navigator.webdriver的特性

def start_requests(self):
    option = webdriver.ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])  # 这里去掉window.navigator.webdriver的特性
    browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe",options=option)

ctrl+a选中后再输入,避免重复输入

from selenium.webdriver.common.keys import Keys

browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[1]/div/form/div[2]/div/label/input').send_keys(Keys.CONTROL+"a")

安装mouse模块获取坐标位置

安装mouse模块获取坐标位置pip install mouse

from mouse import move,click

move(895,603)
click()

输出cookies信息到文本中

def start_requests(self):
    option = webdriver.ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])  # 这里去掉window.navigator.webdriver的特性
    browser = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe",options=option)
    browser.get("https://www.zhihu.com/signin")

    cookies = browser.get_cookies()
    import pickle

    #以二进制格式打开一个文件用于读写。如果该文件已存在则打开文件,并从开头开始编辑,即原有内容会被删除。如果该文件不存在,创建新文件。
    pickle.dump(cookies,open("C:/Users/yjw55/ArticleSpider/ArticleSpider/cookies/zhihu.cookie","wb+"))
    cookie_dic = {}
    for cookie in cookies:
        cookie_dic[cookie["name"]] = cookie["value"]
    return [scrapy.Request(url=self.start_urls[0],dont_filter=True,cookies=cookie_dic)]

关于settings的一些配置

①scrapy中的cookies参数详解
COOKIES_ENABLED

  • 默认: True
  • 是否启用cookiesmiddleware。如果关闭,cookies将不会发送给web server。

COOKIES_DEBUG

  • 默认: False
  • 如果启用,Scrapy将记录所有在request(cookie 请求头)发送的cookies及response接收到的cookies(set-cookie接收头)

USER_AGENT请求头设置

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 3,
}

读取cookies信息到selenium

cookies = browser.get_cookies()
import pickle

pickle.load(cookies,open("C:/Users/yjw55/ArticleSpider/ArticleSpider/cookies/zhihu.cookie","rb"))

知乎倒立文字识别

在项目根目录下执行pip install -i https://pypi.doubanio.com/simple/ -r requirements.txt
(使用豆瓣源安装)

将倒立文字的坐标轴纠正成容易理解的方式:

from zheye import zheye

z = zheye()
positions = z.Recognize('zhihu_image/captcha.gif')

last_positions = []
if len(positions) == 2:
    if positions[0][1] > positions[1][1]:
        last_positions.append([positions[1][1], positions[1][0]])
        last_positions.append([positions[0][1], positions[0][0]])
    else:
        last_positions.append([positions[0][1], positions[0][0]])
        last_positions.append([positions[1][1], positions[1][0]])
else:
    last_positions.append([positions[0][1],positions[0][0]])

print(last_positions)

selenium自动识别验证码完成模拟登录


def start_requests(self):
    from selenium import webdriver
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium.webdriver.chrome.options import Options

    chrome_options = Options()
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_experimental_option("excludeSwitches",['enable-automation'])
    chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")

    browser = webdriver.Chrome(executable_path="E:/chromedriver/chromedriver_win32/chromedriver.exe",  chrome_options=chrome_options)
    # browser = webdriver.Chrome(executable_path="E:/chromedriver/chromedriver_win32/chromedriver.exe")
    import time
    try:
        browser.maximize_window() #将窗口最大化防止定位错误
    except:
        pass
    browser.get("https://www.zhihu.com/signin")
    logo_element = browser.find_element_by_class_name("SignFlowHeader")
    # y_relative_coord = logo_element.location['y']
    #此处一定不要将浏览器放大 会造成高度获取失败!!!
    browser_navigation_panel_height = browser.execute_script('return window.outerHeight - window.innerHeight;')
    time.sleep(5)
    browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(Keys.CONTROL + "a")
    browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
        "18782902568")

    browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
    browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
        "admin13")

    browser.find_element_by_css_selector(
        ".Button.SignFlow-submitButton").click()
    time.sleep(15)
    from mouse import move, click
    # move(800, 400 ,True)
    # actions = ActionChains(browser)
    # actions.move_to_element(browser.find_element_by_css_selector(
    #     ".Button.SignFlow-submitButton"))
    # actions.click(browser.find_element_by_css_selector(
    #     ".Button.SignFlow-submitButton"))
    # actions.perform()
    # actions.move_to_element_with_offset(browser.find_element_by_css_selector(
    #     ".Button.SignFlow-submitButton"), 30, 30).perform()
    #chrome的版本问题有两种解决方案
    #1. 自己启动chrome(推荐) 可以防止chromedriver被识别,因为chromedriver出现的一些js变量可以被服务器识别出来
    #2. 使用chrome60(版本)

    # 先判断是否登录成功
    login_success = False
    while not login_success:
        try:
            notify_element = browser.find_element_by_class_name("Popover PushNotifications AppHeader-notifications")
            login_success = True
        except:
            pass

        try:
            #查询是否有英文验证码
            english_captcha_element = browser.find_element_by_class_name("Captcha-englishImg")
        except:
            english_captcha_element = None
        try:
            # 查询是否有中文验证码
            chinese_captcha_element = browser.find_element_by_class_name("Captcha-chineseImg")
        except:
            chinese_captcha_element = None

        if chinese_captcha_element:
            y_relative_coord = chinese_captcha_element.location['y']
            y_absolute_coord = y_relative_coord + browser_navigation_panel_height
            x_absolute_coord = chinese_captcha_element.location['x']
            # x_absolute_coord = 842
            # y_absolute_coord = 428

            """
            保存图片
            1. 通过保存base64编码
            2. 通过crop方法
            """
            # 1. 通过保存base64编码
            base64_text = chinese_captcha_element.get_attribute("src")
            import base64
            code = base64_text.replace('data:image/jpg;base64,', '').replace("%0A", "")
            # print code
            fh = open("yzm_cn.jpeg", "wb")
            fh.write(base64.b64decode(code))
            fh.close()

            from zheye import zheye
            z = zheye()
            positions = z.Recognize("yzm_cn.jpeg")

            pos_arr = []
            if len(positions) == 2:
                if positions[0][1] > positions[1][1]:
                    pos_arr.append([positions[1][1], positions[1][0]])
                    pos_arr.append([positions[0][1], positions[0][0]])
                else:
                    pos_arr.append([positions[0][1], positions[0][0]])
                    pos_arr.append([positions[1][1], positions[1][0]])
            else:
                pos_arr.append([positions[0][1], positions[0][0]])

            if len(positions) == 2:
                first_point = [int(pos_arr[0][0] / 2), int(pos_arr[0][1] / 2)]
                second_point = [int(pos_arr[1][0] / 2), int(pos_arr[1][1] / 2)]

                move((x_absolute_coord + first_point[0]), y_absolute_coord + first_point[1])
                click()

                move((x_absolute_coord + second_point[0]), y_absolute_coord + second_point[1])
                click()

            else:
                first_point = [int(pos_arr[0][0] / 2), int(pos_arr[0][1] / 2)]

                move((x_absolute_coord + first_point[0]), y_absolute_coord + first_point[1])
                click()

            browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
                Keys.CONTROL + "a")
            browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
                "18782902568")

            browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
            browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
                "admin1234")
            browser.find_element_by_css_selector(
                ".Button.SignFlow-submitButton").click()
            browser.find_element_by_css_selector(
                ".Button.SignFlow-submitButton").click()

        if english_captcha_element:
            # 2. 通过crop方法
            # from pil import Image
            # image = Image.open(path)
            # image = image.crop((locations["x"], locations["y"], locations["x"] + image_size["width"],
            #                     locations["y"] + image_size["height"]))  # defines crop points
            #
            # rgb_im = image.convert('RGB')
            # rgb_im.save("D:/ImoocProjects/python_scrapy/coding-92/ArticleSpider/tools/image/yzm.jpeg",
            #             'jpeg')  # saves new cropped image
            # # 1. 通过保存base64编码
            base64_text = english_captcha_element.get_attribute("src")
            import base64
            code = base64_text.replace('data:image/jpg;base64,', '').replace("%0A", "")
            # print code
            fh = open("yzm_en.jpeg", "wb")
            fh.write(base64.b64decode(code))
            fh.close()

            from tools.yundama_requests import YDMHttp
            yundama = YDMHttp("da_ge_da1", "dageda", 3129, "40d5ad41c047179fc797631e3b9c3025")
            code = yundama.decode("yzm_en.jpeg", 5000, 60)
            while True:
                if code == "":
                    code = yundama.decode("yzm_en.jpeg", 5000, 60)
                else:
                    break
            browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
            browser.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[2]/div[1]/form/div[3]/div/div/div[1]/input').send_keys(code)

            browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
                Keys.CONTROL + "a")
            browser.find_element_by_css_selector(".SignFlow-accountInput.Input-wrapper input").send_keys(
                "18782902568")

            browser.find_element_by_css_selector(".SignFlow-password input").send_keys(Keys.CONTROL + "a")
            browser.find_element_by_css_selector(".SignFlow-password input").send_keys(
                "admin1234")
            submit_ele = browser.find_element_by_css_selector(".Button.SignFlow-submitButton")
            browser.find_element_by_css_selector(".Button.SignFlow-submitButton").click()

        time.sleep(10)
        try:
            notify_element = browser.find_element_by_class_name("Popover PushNotifications AppHeader-notifications")
            login_success = True

            Cookies = browser.get_cookies()
            print(Cookies)
            cookie_dict = {}
            import pickle
            for cookie in Cookies:
                # 写入文件
                # 此处大家修改一下自己文件的所在路径
                f = open('./ArticleSpider/cookies/zhihu/' + cookie['name'] + '.zhihu', 'wb')
                pickle.dump(cookie, f)
                f.close()
                cookie_dict[cookie['name']] = cookie['value']
            browser.close()
            return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)]
        except:
            pass

    print("yes")

requests模拟登陆知乎(可选)

#sunlands_login_requests.py

import requests
try:
    import cookielib
except:
    import http.cookiejar as cookielib

import re

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")
try:
    session.cookies.load(ignore_discard=True)
except:
    print ("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhizhu.com",
    'User-Agent': agent
}

def is_login():
    #通过个人中心页面返回状态码来判断是否为登录状态
    inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
    response = session.get(inbox_url, headers=header, allow_redirects=False)
    if response.status_code != 200:
        return False
    else:
        return True

def get_xsrf():
    #获取xsrf code
    response = session.get("https://www.zhihu.com", headers=header)
    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
    if match_obj:
        return (match_obj.group(1))
    else:
        return ""


def get_index():
    response = session.get("https://www.zhihu.com", headers=header)
    with open("index_page.html", "wb") as f:
        f.write(response.text.encode("utf-8"))
    print ("ok")

def get_captcha():
    import time
    t = str(int(time.time()*1000))
    captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
    t = session.get(captcha_url, headers=header)
    with open("captcha.jpg","wb") as f:
        f.write(t.content)
        f.close()

    from PIL import Image
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        pass

    captcha = input("输入验证码\n>")
    return captcha

def zhihu_login(account, password):
    #知乎登录
    if re.match("^1\d{10}",account):
        print ("手机号码登录")
        post_url = "https://www.zhihu.com/login/phone_num"
        post_data = {
            "_xsrf": get_xsrf(),
            "phone_num": account,
            "password": password,
            "captcha":get_captcha()
        }
    else:
        if "@" in account:
            #判断用户名是否为邮箱
            print("邮箱方式登录")
            post_url = "https://www.zhihu.com/login/email"
            post_data = {
                "_xsrf": get_xsrf(),
                "email": account,
                "password": password
            }

    response_text = session.post(post_url, data=post_data, headers=header)
    session.cookies.save()

zhihu_login("18782902568", "admin123")
# get_index()
is_login()

scrapy模拟知乎登录(可选)

def start_requests(self):
    return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]

def login(self, response):
    response_text = response.text
    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response_text, re.DOTALL)
    xsrf = ''
    if match_obj:
        xsrf = (match_obj.group(1))

    if xsrf:
        post_url = "https://www.zhihu.com/login/phone_num"
        post_data = {
            "_xsrf": xsrf,
            "phone_num": "",
            "password": "",
            "captcha": ""
        }

        import time
        t = str(int(time.time() * 1000))
        captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
        yield scrapy.Request(captcha_url, headers=self.headers, meta={"post_data":post_data}, callback=self.login_after_captcha)


def login_after_captcha(self, response):
    with open("captcha.jpg", "wb") as f:
        f.write(response.body)
        f.close()

    from PIL import Image
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        pass

    captcha = input("输入验证码\n>")

    post_data = response.meta.get("post_data", {})
    post_url = "https://www.zhihu.com/login/phone_num"
    post_data["captcha"] = captcha
    return [scrapy.FormRequest(
        url=post_url,
        formdata=post_data,
        headers=self.headers,
        callback=self.check_login
    )]

def check_login(self, response):
    #验证服务器的返回数据判断是否成功
    text_json = json.loads(response.text)
    if "msg" in text_json and text_json["msg"] == "登录成功":
        for url in self.start_urls:
            yield scrapy.Request(url, dont_filter=True, headers=self.headers)

知乎携带cookie请求登录

#请求命令从这里开始
    def start_requests(self):
       # 该cookie是登陆后获取的cookie字符串
       cookies = '''
       _zap=e1a0721a-d85f-4636-b109-549fe2920bbd; _xsrf=fJjAzECZacKvHcUuS4slEC0HpoAfCwFB; 
       '''
       # 转换成字典
       cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
       # 默认yield到parse的,可以不写callback=self.parse
       yield scrapy.Request('https://www.zhihu.com/',headers=self.headers,cookies=cookies)

settings.py(配置文件,不要禁用cookie):

# 禁用cookie (默认开启cookie)
# COOKIES_ENABLED = False
COOKIES_DEBUG = True  # 开启cookie的调试信息

yield理解

  • 如果是yield item 会到pipelins中处理;
  • 如果是yield.Request会到下载器去下载。

知乎爬虫逻辑的实现以及item loder方式提取

工具类和settings.py的必要设置

utils.common工具类:

#common.py

def extract_num(text):
    #从字符串中提取出数字
    match_re = re.match(".*?(\d+).*", text)
    if match_re:
        nums = int(match_re.group(1))
    else:
        nums = 0

    return nums

设置日期输出格式和ITEM_PIPELINES,settings.py

ITEM_PIPELINES = {
    'ArticleSpider.pipelines.MysqlTwistedPipline': 4,
}

SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"

知乎spider爬虫逻辑的实现

#zhihu.py
import scrapy
from urllib import parse
import re
import datetime
from items import ZhihuQuestionItem,ZhihuAnswerItem
from scrapy.loader import ItemLoader
import json

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']

    # question的第一页answer的请求url
    start_answer_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={1}&offset={2}&platform=desktop&sort_by=default"


    headers = {
        "HOST": "www.zhihu.com",
        "Referer": "https://www.zhizhu.com",
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
    }

    def parse(self, response):
        """
        提取出html页面中的所有url 并跟踪这些url进行一步爬取
        如果提取的url中格式为 /question/xxx 就下载之后直接进入解析函数
        """
        all_urls = response.css("a::attr(href)").extract()
        all_urls = [parse.urljoin(response.url,url) for url in all_urls]
        #filter() 函数用于过滤序列,过滤掉不符合条件的元素,匿名函数传递入x
        all_urls = filter(lambda x:True if x.startswith("https") else False, all_urls)
        for url in all_urls:
            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url)
            if match_obj:
                request_url = match_obj.group(1)
                yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question)
                # break
            else:
                # 如果不是question页面则直接进一步跟踪
                yield scrapy.Request(url,headers=self.headers,callback=self.parse)

        pass


    def parse_question(self,response):
        # 处理question页面, 从页面中提取出具体的question item
        match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)
        if match_obj:
            question_id = int(match_obj.group(2))

        item_loader = ItemLoader(item=ZhihuQuestionItem(),response=response)
        #从 css 提取出的数据,传递给 输入处理器 的 title 字段.输入处理器的结果被收集和保存在Item Loader中(但尚未分配给该Item)。
        item_loader.add_css("title", "h1.QuestionHeader-title::text")
        item_loader.add_css("content", ".QuestionHeader-detail")
        item_loader.add_value("url", response.url)
        item_loader.add_value("zhihu_id", question_id)
        item_loader.add_css("answer_num", ".List-headerText span::text")
        item_loader.add_css("topics", ".QuestionHeader-topics .Popover div::text")

        question_item = item_loader.load_item()

        yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0), headers=self.headers,
                             callback=self.parse_answer)
        #分配给item,走管道
        yield question_item


    def parse_answer(self,response):
        # 处理question的answer
        ans_json = json.loads(response.text)
        is_end = ans_json["paging"]["is_end"]
        next_url = ans_json["paging"]["next"]

        # 提取answer的具体字段
        for answer in ans_json["data"]:
            answer_item = ZhihuAnswerItem()

            answer_item["zhihu_id"] = answer["id"]
            answer_item["url"] = answer["url"]
            answer_item["question_id"] = answer["question"]["id"]
            answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None
            answer_item["content"] = answer["content"] if "content" in answer else None
            answer_item["parise_num"] = answer["voteup_count"]
            answer_item["comments_num"] = answer["comment_count"]
            answer_item["create_time"] = answer["created_time"]
            answer_item["update_time"] = answer["updated_time"]
            answer_item["crawl_time"] = datetime.datetime.now()

            yield answer_item

        pass
        if not is_end:
            yield scrapy.Request(next_url,headers=self.headers,callback=self.parse_answer)


    #请求命令从这里开始
    def start_requests(self):
       # 该cookie是登陆后获取的cookie字符串
       cookies = '''
       _zap=e1a0721a-d85f-4636-b109-549fe2920bbd; _xsrf=fJjAzECZacKvHcUuS4slEC0HpoAfCwFB; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1590331566,1590365636,1590366632,1590372221;  '''
       # 转换成字典
       cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
       # 默认yield到parse的,可以不写callback=self.parse
       yield scrapy.Request('https://www.zhihu.com/',headers=self.headers,cookies=cookies)

item loder方式提取(items.py)与保存数据到mysql中

采用异步的机制写入mysql:

#pipelines.py
import MySQLdb
import MySQLdb.cursors

class MysqlTwistedPipline(object):
    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparms = dict(
            host = settings["MYSQL_HOST"],
            db = settings["MYSQL_DBNAME"],
            user = settings["MYSQL_USER"],
            passwd = settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)

        return cls(dbpool)

    def process_item(self, item, spider):
        #使用twisted将mysql插入变成异步执行
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) #处理异常
        return item

    def handle_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

    def do_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

item loder方式提取(items.py):

# items.py
import datetime

from utils.common import extract_num
from settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT

class ZhihuQuestionItem(scrapy.Item):
    # 知乎的问题 item
    zhihu_id = scrapy.Field()
    topics = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    answer_num = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(self):
        #插入知乎question表的sql语句
        insert_sql = """
            insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num,crawl_time
              )
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            ON DUPLICATE KEY UPDATE content=VALUES(content), answer_num=VALUES(answer_num)
            """
        zhihu_id = self["zhihu_id"][0]
        topics = ",".join(self["topics"])
        url = self["url"][0]
        title = "".join(self["title"])
        content = "".join(self["content"])
        answer_num = extract_num("".join(self["answer_num"]))

        crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)

        params = (zhihu_id,topics,url,title,content,answer_num,crawl_time)

        return insert_sql, params

class ZhihuAnswerItem(scrapy.Item):
    #知乎的问题回答item
    zhihu_id = scrapy.Field()
    url = scrapy.Field()
    question_id = scrapy.Field()
    author_id = scrapy.Field()
    content = scrapy.Field()
    parise_num = scrapy.Field()
    comments_num = scrapy.Field()
    create_time = scrapy.Field()
    update_time = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(self):
        #插入知乎question表的sql语句
        insert_sql = """
            insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,
              create_time, update_time, crawl_time
              ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
              ON DUPLICATE KEY UPDATE content=VALUES(content), comments_num=VALUES(comments_num), parise_num=VALUES(parise_num),
              update_time=VALUES(update_time)
        """

        create_time = datetime.datetime.fromtimestamp(self["create_time"]).strftime(SQL_DATETIME_FORMAT)
        update_time = datetime.datetime.fromtimestamp(self["update_time"]).strftime(SQL_DATETIME_FORMAT)
        params = (
            self["zhihu_id"], self["url"], self["question_id"],
            self["author_id"], self["content"], self["parise_num"],
            self["comments_num"], create_time, update_time,
            self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params

本博客所有文章除特别声明外,均采用 CC BY-SA 3.0协议 。转载请注明出处!

MySQL数据类型 上一篇
Scrapy爬取知名技术文章网站 下一篇