获取lxml中标签内的所有文本


问题内容

我想编写一个代码片段<content>,在下面的所有三个实例中(包括代码标签),它将在lxml中捕获标签内的所有文本。我已经尝试过了,tostring(getchildren())但是会错过标签之间的文本。我没有太多运气在API中搜索相关功能。你能帮我吗?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

问题答案:

尝试:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

例:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

产生: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'