提问者:小点点

将HTML表转换为JSON(Python)


我有很多HTML表,我正在尝试将其转换为json格式,但我的代码只适用于第一个水平表(第一个图像),而不是第二个垂直表(第二个图像)...

我在这里附上了我的代码和示例表

到目前为止我尝试的代码

html_data=Path("Table2.html").read_text()
table_data = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_data,features="lxml")("tr")]
json_data=[]
for list1 in table_data:
    list1 = [i.replace('\n', '') for i in list1]
    dict1 = dict(itertools.zip_longest(*[iter(list1)] * 2, fillvalue=""))
    json_data.append(dict1)
print(json_data)

以上HTML表的输出:

[{'Address': '41 B Market street'}, {'City': 'Gujarat'}, {'Postal/Zip Code': '123456'}, {'Product Details': ''}, {'Pallet Dimension': '10" x 10" x 10"'}, {'Total Weight': '1375 LBS'}]

[{'Pickup Location': 'Description', '': ''}, {'Some Address': 'Rubics cube', '': ''}, {}, {'PLTS': 'total weight', 'L': 'W', 'H': ''}, {'1': '20', '40': ''}, {'2': '60', '40': ''}]

表2的HTML代码

<table>
<tbody>
<tr style="height:15.0pt">
<td colspan="2" style="width:130.9pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt" width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Pickup Location</span></b></p>
</td>
<td colspan="3" style="width:130.1pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt" width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Description</span></b></p>
</td>
<td style="width:1.5pt; padding:0in 0in 0in 0in; height:15.0pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:15.0pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td colspan="2" rowspan="2" style="width:130.9pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Some Address</span></b></p>
</td>
<td colspan="3" rowspan="2" style="width:130.1pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Rubics cube</span></b></p>
</td><td style="width:1.5pt; padding:0in 0in 0in 0in; height:13.15pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.15pt" width="0"></td>
</tr>
<tr style="height:15.75pt">
</tr>
<tr style="height:.3in">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">PLTS</span></b><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">total weight</span></b><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">L</span></b><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">W</span></b><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">H</span></b><b><span style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
</td>
</tr>
<tr style="height:13.9pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">1</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">20</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.9pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">2</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">60</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
</td>
</tr>
</tbody>
</table>

null

如果表是水平表(表1),那么旧的输出就足够了

[{'Address': '41 B Market street'}, {'City': 'Gujarat'}, {'Postal/Zip Code': '123456'}, {'Product Details': ''}, {'Pallet Dimension': '10" x 10" x 10"'}, {'Total Weight': '1375 LBS'}]

如果表是一个垂直表(表2),那么输出应该如下所示:

[{'Pickup address': 'some address'}, {'Description': 'Rubicks cube'}, {'PLTS': ['1','2']}, {'Total weight': ['20','60']}, {'L':['40','40']}, {'W':['40','40']},{'H':['40','40']}]

我试过修改代码,但对我没有任何建议???


共1个答案

匿名用户

我试图使这个解决方案尽可能通用,同时也满足所提出的具体案例。这只适用于第二种情况,因为没有th元素或仅在头元素上有特定的类,所以无法通过编程确定头是垂直的还是水平的。

from bs4 import BeautifulSoup
from pathlib import Path
import itertools

html_data = Path("table.html").read_text()
table_data = [[td.text.strip() for td in tr("td") if td.text.strip()]
              for tr in BeautifulSoup(html_data, features="lxml")("tr")]

out = [dict([(t, rest if len(rest) > 1 else rest[0]) for t, *rest in zip(*g)]) for k, g in
       itertools.groupby(table_data, key=bool) if k]
print(out)

输出:

[{'Pickup Location': 'Some Address', 'Description': 'Rubics cube'}, {'PLTS': ['1', '2'], 'total weight': ['20', '60'], 'L': ['40', '40'], 'W': ['40', '40'], 'H': ['40', '40']}]

table.html

<table>
    <tbody>
    <tr style="height:15.0pt">
        <td colspan="2"
            style="width:130.9pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt"
            width="175">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Pickup Location</span></b>
            </p>
        </td>
        <td colspan="3"
            style="width:130.1pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt"
            width="173">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Description</span></b>
            </p>
        </td>
        <td style="width:1.5pt; padding:0in 0in 0in 0in; height:15.0pt" width="2">
            <p class="MsoNormal"></p>
        </td>
        <td style="width:.3pt; padding:0in 0in 0in 0in; height:15.0pt" width="0"></td>
    </tr>
    <tr style="height:13.15pt">
        <td colspan="2" rowspan="2"
            style="width:130.9pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="175">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Some Address</span></b>
            </p>
        </td>
        <td colspan="3" rowspan="2"
            style="width:130.1pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="173">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">Rubics cube</span></b>
            </p>
        </td>
        <td style="width:1.5pt; padding:0in 0in 0in 0in; height:13.15pt" width="2">
            <p class="MsoNormal"></p>
        </td>
        <td style="width:.3pt; padding:0in 0in 0in 0in; height:13.15pt" width="0"></td>
    </tr>
    <tr style="height:15.75pt">
    </tr>
    <tr style="height:.3in">
        <td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
            width="56" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">PLTS</span></b><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
        </td>
        <td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
            width="118" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">total weight</span></b><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">L</span></b><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">W</span></b><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
        </td>
        <td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
            width="23" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">H</span></b><b><span
                    style="font-size:10.0pt; font-family:&quot;Arial&quot;,sans-serif"></span></b></p>
        </td>
    </tr>
    <tr style="height:13.9pt">
        <td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
            width="56" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">1</span></p>
        </td>
        <td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
            width="118" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">20</span></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
        <td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
            width="23" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
        <td style="width:.3pt; padding:0in 0in 0in 0in; height:13.9pt" width="0"></td>
    </tr>
    <tr style="height:13.15pt">
        <td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="56" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif; color:black">2</span></p>
        </td>
        <td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="118" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">60</span></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
        <td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="27" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
        <td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
            width="23" valign="bottom" nowrap="nowrap">
            <p class="MsoNormal" style="text-align:center" align="center"><span
                    style="font-size:12.0pt; font-family:&quot;Arial&quot;,sans-serif">40</span></p>
        </td>
    </tr>
    </tbody>
</table>

相关问题