我正在构建一个Python web scraper,它会遍历易贝搜索结果页面(在本例中为“游戏笔记本”),并抓取每件待售商品的标题。我使用BeautifulSoup首先获取存储每个标题的h1标记,然后将其作为文本打印出来:
for item_name in soup.findAll('h1', {'class': 'it-ttl'}):
print(item_name.text)
然而,在类为“it-ttl”的每个h1标记中,还有一个span标记包含一些文本:
<h1 class="it-ttl" itemprop="name" id="itemTitle">
<span class="g-hdn">Details about </span>
Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
我当前的程序打印出span标签的内容和项目标题:我的控制台输出
有人能给我解释一下如何只抓取项目标题,而忽略包含“详细信息”文本的span标签吗?谢谢!
只需删除有问题的即可完成
item = """
<h1 class="it-ttl" itemprop="name" id="itemTitle">
<span class="g-hdn">Details about </span>
Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
"""
from bs4 import BeautifulSoup as bs
soup = bs(item,'lxml')
target = soup.select_one('h1')
target.select_one('span').decompose()
print(target.text.strip())
输出:
Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<h1 class="it-ttl" itemprop="name" id="itemTitle">
<span class="g-hdn">Details about </span>
Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
'''
doc = SimplifiedDoc(html)
item_names = doc.selects('h1.it-ttl').span.nextText()
print(item_names)
结果:
['Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…']
以下是更多示例。https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples