提问者:小点点

如何使用Python下载多个PDF文件?


我正在尝试下载https://occ.ca/our-publications

我的最终目标是解析PDF文件中的文本并定位某些关键字。

到目前为止,我已经能够抓取所有页面上PDF文件的链接。我已将这些链接保存到列表中。现在,我想浏览一下列表并用Python下载所有pdf文件。下载完文件后,我想对它们进行解析。

这是我迄今为止使用的代码:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.append(links)
import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))

这是我运行代码时遇到的错误。

回溯(最近的最后一次调用):urllib中的文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\m.py”,第23行。要求urlretrieve(x,'Publication\uu{}.pdf'。格式(范围(213)))文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”,第247行,在urlretrieve with contextlib中。在urlopen return opener中,以fp:File“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”的形式关闭(urlopen(url,data)),第222行。打开(url,数据,超时)文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”,第531行,在open response=meth(req,response)文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”中的http响应“http”,请求,响应,代码,消息,hdrs”文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”中的第641行,第569行,返回self时出错_call\u chain result=func(*args)文件“C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py”中的第503行,http\u error\u default raise HTTPError(req.full\u url,code,msg,hdrs,fp)urllib中的C:\Users\plumm\AppData\Local\Programs\Python37\lib\urllib\request.py。错误HTTPError:HTTP错误403:禁止


共2个答案

匿名用户

请尝试:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.extend(links)

for cntr, link in enumerate(publications):
    print("try to get link", link)
    rslt = requests.get(link)
    print("Got", rslt)
    fname = "temporarypdf_%d.pdf" % cntr
    with open("temporarypdf_%d.pdf" % cntr, "wb") as fout:
        fout.write(rslt.raw.read())
    print("saved pdf data into ", fname)
    # Call here the code that reads and parses the pdf.

匿名用户

你能告诉我错误所在的行号吗?