获取lxml中标签内的所有文本
问题内容:
我想编写一个代码片段<content>
,在下面的所有三个实例中(包括代码标签),它将在lxml中捕获标签内的所有文本。我已经尝试过了,tostring(getchildren())
但是会错过标签之间的文本。我没有太多运气在API中搜索相关功能。你能帮我吗?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
问题答案:
尝试:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
例:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
产生: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'