我是一个相当新的时候,谈到这一点,我已经工作了几天的网页刮削现在。我一直在积极地试图避免问这个问题,但我真的被困住了。
我的问题
我知道我有很多未使用的进口产品。这些只是我尝试的各种途径,但还没有消除它们。
这里的最终目标是推送到json或csv文件(目前也没有编写--但有了数据后,对如何处理这方面有一个很好的想法。
from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib
import locale
import json
from selenium import webdriver
os.environ["PYTHONIOENCODING"] = "utf-8"
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")
for product in products:
#title
title = product.find("h3")
titleText = title.text if title else ''
#manufacturer name
manufacturer = product.find("div", "GC5 ProductManufacturer")
manuText = manufacturer.text if manufacturer else ''
#image location
img = product.find("div", "ProductImage")
imglinks = img.find("a") if img else ''
imglinkhref = imglinks.get('href') if imglinks else ''
imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
#print(imgurl.replace('..', ''))
#description
description = product.find("div", "GC12 ProductDescription")
descText = description.text if description else ''
#more description
more = product.find("div", "GC12 ProductDetailedDescription")
moreText = more.text if more else ''
#price - not right
spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
for i in range(0,len(spans),2):
span = spans[i].text
print(span)
i+=1
print(titleText)
print(manuText)
print(descText)
print(moreText)
print(imgurl.replace('..', ''))
print("\n")
输出:
£1,695.00
£1,885.00
£1,885.00
£2,015.00
£2,175.00
£2,175.00
£2,385.00
£2,115.00
£3,025.00
£3,315.00
£3,635.00
£3,925.00
£2,765.00
£3,045.00
£3,325.00
£3,615.00
£3,455.00
£2,815.00
£3,300.00
£3,000.00
£7,225.00
£7,555.00
£7,635.00
£7,015.00
£7,355.00
12G Beretta DT11 Trap Adjustable
beretta
Click on more for full details.
You may order this gun online, however due to UK Legislation we are not permitted to ship directly to you, we can however ship to a registered firearms dealer local to you. Once we receive your order, we will contact you to arrange which registered firearms dealer you would like the gun to be shipped to.
DT 11 Trap (Steelium Pro)
12
2 3/4"
10x10 rib
3/4&F
30"/32" weight; 4k
https://www.mcavoyguns.co.uk/contents/media/l_dt11_02.jpg
@baduker的评论给我指明了正确的方向--缩进。谢谢!
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib
import locale
import json
from selenium import webdriver
os.environ["PYTHONIOENCODING"] = "utf-8"
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")
for product in products:
#title
title = product.find("h3")
titleText = title.text if title else ''
#manufacturer name
manufacturer = product.find("div", "GC5 ProductManufacturer")
manuText = manufacturer.text if manufacturer else ''
#image location
img = product.find("div", "ProductImage")
imglinks = img.find("a") if img else ''
imglinkhref = imglinks.get('href') if imglinks else ''
imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
#print(imgurl.replace('..', ''))
#description
description = product.find("div", "GC12 ProductDescription")
descText = description.text if description else ''
#more description
more = product.find("div", "GC12 ProductDetailedDescription")
moreText = more.text if more else ''
#price - not right
spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
for i in range(0,len(spans),2):
span = spans[i].text
print(span)
i+=1
print(titleText)
print(manuText)
print(descText)
print(moreText)
print(imgurl.replace('..', ''))
print("\n")