BeautifulSoup刮.文本属性问题

提问者：小点点

BeautifulSoup刮.文本属性问题

我有下面的代码来刮一个页面，https://www.hotukdeals.com

from bs4 import BeautifulSoup
import requests

url="https://www.hotukdeals.com/hot"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
deals = soup.find_all("article")
for deal in deals:
    priceElement = deal.find("span",{"class":"thread-price"})
    try:
        print(priceElement,priceElement.text)
    except AttributeError:
        pass

由于某种原因，这种方法起作用，在循环中刮取交易的价格一定的次数，然后停止工作。

程序输出：

<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£9.09</span> £9.09
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£39.95</span> £39.95
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£424.98</span> £424.98
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£8.10</span> £8.10
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£14.59</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.50</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£20</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£19</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£29</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£49.97</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">FREE</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.49</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£54.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£12.85</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£21.03</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£5.29</span>

从输出中可以看到，在前四行之后，.text属性为空，但元素中有文本。

有人知道这事吗？有什么想法或解决办法吗？

共2个答案

匿名用户

我不确定到底是什么导致了这个问题，但我找到了一个解决方法，只需找到文本字段的开始和结束，并使用string.find（）获得它的索引。下面是它的一个实现：

from bs4 import BeautifulSoup
import requests

url = "https://www.hotukdeals.com/hot"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
deals = soup.find_all("article")
for deal in deals:
    priceElement = deal.find("span", {"class": "thread-price"})

    if priceElement is not None:
        price = str(priceElement)
        start_price = price.find('">') + len('">')  # finds the start of the price
        end_price = price.find("</span")  # finds the end of the price area
        price = price[start_price:end_price] 
    else:
        price = None
    try:
        print(priceElement, price)
    except AttributeError:
        pass

匿名用户

Beautifulsoup需要HTML5lib解析器来正确解析站点，例如：

import requests
from bs4 import BeautifulSoup

url = "https://www.hotukdeals.com/"

soup = BeautifulSoup(requests.get(url).content, "html5lib")  # <-- use html5lib

for price in soup.select(".thread-price"):
    print(price.text)

打印：

£149.99
£7
£21.03
£31.79
£359.10
£19.99
£60
£0.60
£168
£4.99
£20
£119
Free P&P
Free
£5
£89.99
FREE
£10.96
£1.79

BeautifulSoup刮.文本属性问题

共2个答案

相关问题

热门标签

BeautifulSoup刮.文本属性问题

共2个答案

相关问题

热门标签

微信关注