使用请求登录并美化组以刮取页面

提问者：小点点

使用请求登录并美化组以刮取页面

我需要刮一个需要登录才能访问的页面。

我尝试使用请求和BeautifulSoup，使用cUrl中转换的保存的登录信息登录，但不起作用。

我需要登录'https://www.seoprofiler.com/account/login'然后刮去像这样的页面：'https://www.seoprofiler.com/lp/links?q=test.com'

这是我的密码：

from bs4 import BeautifulSoup 
import requests



cookies = {
    'csrftoken': 'token123',
    'seoprofilersession': 'session123',
}

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'sec-ch-ua': '^\\^',
    'sec-ch-ua-mobile': '?0',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'https://www.seoprofiler.com',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.seoprofiler.com/account/login',
    'Accept-Language': 'en,en-US;q=0.9,it;q=0.8',
}

data = {
    'csrfmiddlewaretoken': 'token123',
    'username': 'email123@gmail.com',
    'password': 'pass123!',
    'button': ''
}



response = requests.post('https://www.seoprofiler.com/account/login',
                             headers=headers, cookies=cookies, data=data)


url = 'https://www.seoprofiler.com/lp/links?q=test.com'
response = requests.get(url, headers= headers, cookies=cookies)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
print(soup.title)

我不会使用硒，因为我必须抓取大量数据，这需要大量的时间使用硒。

我如何登录以刮取已登录的页面？非常感谢。

共2个答案

匿名用户

您可以使用请求。一场

经过一些尝试和错误后，我能够使用以下脚本登录并获取项目页面：

import requests

session = requests.Session() # Create new session
session.get(
    "https://www.seoprofiler.com/account/login"
)  # set seoprofilersession and csrftoken cookies

session.post(
    "https://www.seoprofiler.com/account/login",
    data={
        "csrfmiddlewaretoken": session.cookies.get_dict()["csrftoken"],
        "username": "your_email",
        "password": "your_password",
    },
)  # login, sets needed cookies

# Now use this session to get all data you need!
resp = session.get(
    "https://www.seoprofiler.com/project/google.com-fa1b9c855721f3d5"
)  # get main page content

print(resp.status_code) # my output: 200

编辑：

刚刚检查了另一件事，似乎检索seoprofilersession和csrftoken cookies不是强制性的，您只需使用您的凭据调用login post即可（不使用csrfmiddlewaretoken，然后使用您的会话）

匿名用户

您如何知道必须传递到登录页面的数据结构？更可靠的解决方案是使用selenium填充登录页面的用户名和密码字段，然后单击登录按钮。接下来，转到所需页面并将其刮平。


		      
                相关问题
                

																                
					
										   如何迭代Hashmap并与同一Hashmap中的其他键进行组合以比较它们的对象
										   如何防止对数组中类对象的重复引用？
										   HashMap如何识别内部数组中的哪些位置包含元素？
										   异步管道是否从服务中定义并从组件变量指向的可观察对象取消订阅？
										   角度2秒请求可观察
										   组件中的Angular 2重复订阅
										   应该在ngOnDestroy（）中将Angular组件变量设置为null吗？
										   Angular：定期请求时如何取消订阅
										   Angular2处理非组件类中的订阅
										   Angular： in ngOnInit（）当我重新加载组件时，我的rxjs Fucntion不会运行
										   如何编写一个函数，使超文本传输协议请求并返回请求的结果？
										   Angular 2缓存超文本传输协议请求使用可观察对象的力量
										   在视图中使用@input可观察，但组件中的管道函数被忽略
										   如何将异步管道过滤成Angular子组件
										   Angular 4每次组件加载时重新加载函数
										   组件被销毁和重访后，订阅在ngOnInit函数中运行
										   为什么组件销毁后订阅仍然存在？
										   Angular 5将服务注入组件
										   RxJS：带直到（）角度组件的ngOnDestroy（）
										   如何取消订阅角度组件中的多个可观察对象？

使用请求登录并美化组以刮取页面

共2个答案

相关问题

热门标签

微信关注