我有很多HTML表,我正在尝试将其转换为json格式,但我的代码只适用于第一个水平表(第一个图像),而不是第二个垂直表(第二个图像)...
我在这里附上了我的代码和示例表
到目前为止我尝试的代码
html_data=Path("Table2.html").read_text()
table_data = [[cell.text for cell in row("td")]
for row in BeautifulSoup(html_data,features="lxml")("tr")]
json_data=[]
for list1 in table_data:
list1 = [i.replace('\n', '') for i in list1]
dict1 = dict(itertools.zip_longest(*[iter(list1)] * 2, fillvalue=""))
json_data.append(dict1)
print(json_data)
以上HTML表的输出:
[{'Address': '41 B Market street'}, {'City': 'Gujarat'}, {'Postal/Zip Code': '123456'}, {'Product Details': ''}, {'Pallet Dimension': '10" x 10" x 10"'}, {'Total Weight': '1375 LBS'}]
[{'Pickup Location': 'Description', '': ''}, {'Some Address': 'Rubics cube', '': ''}, {}, {'PLTS': 'total weight', 'L': 'W', 'H': ''}, {'1': '20', '40': ''}, {'2': '60', '40': ''}]
表2的HTML代码
<table>
<tbody>
<tr style="height:15.0pt">
<td colspan="2" style="width:130.9pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt" width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Pickup Location</span></b></p>
</td>
<td colspan="3" style="width:130.1pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt" width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Description</span></b></p>
</td>
<td style="width:1.5pt; padding:0in 0in 0in 0in; height:15.0pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:15.0pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td colspan="2" rowspan="2" style="width:130.9pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Some Address</span></b></p>
</td>
<td colspan="3" rowspan="2" style="width:130.1pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Rubics cube</span></b></p>
</td><td style="width:1.5pt; padding:0in 0in 0in 0in; height:13.15pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.15pt" width="0"></td>
</tr>
<tr style="height:15.75pt">
</tr>
<tr style="height:.3in">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">PLTS</span></b><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">total weight</span></b><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">L</span></b><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">W</span></b><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">H</span></b><b><span style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
</tr>
<tr style="height:13.9pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif; color:black">1</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">20</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.9pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif; color:black">2</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">60</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt" width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
</tr>
</tbody>
</table>
null
如果表是水平表(表1),那么旧的输出就足够了
[{'Address': '41 B Market street'}, {'City': 'Gujarat'}, {'Postal/Zip Code': '123456'}, {'Product Details': ''}, {'Pallet Dimension': '10" x 10" x 10"'}, {'Total Weight': '1375 LBS'}]
如果表是一个垂直表(表2),那么输出应该如下所示:
[{'Pickup address': 'some address'}, {'Description': 'Rubicks cube'}, {'PLTS': ['1','2']}, {'Total weight': ['20','60']}, {'L':['40','40']}, {'W':['40','40']},{'H':['40','40']}]
我试过修改代码,但对我没有任何建议???
我试图使这个解决方案尽可能通用,同时也满足所提出的具体案例。这只适用于第二种情况,因为没有th
元素或仅在头元素上有特定的类,所以无法通过编程确定头是垂直的还是水平的。
from bs4 import BeautifulSoup
from pathlib import Path
import itertools
html_data = Path("table.html").read_text()
table_data = [[td.text.strip() for td in tr("td") if td.text.strip()]
for tr in BeautifulSoup(html_data, features="lxml")("tr")]
out = [dict([(t, rest if len(rest) > 1 else rest[0]) for t, *rest in zip(*g)]) for k, g in
itertools.groupby(table_data, key=bool) if k]
print(out)
输出:
[{'Pickup Location': 'Some Address', 'Description': 'Rubics cube'}, {'PLTS': ['1', '2'], 'total weight': ['20', '60'], 'L': ['40', '40'], 'W': ['40', '40'], 'H': ['40', '40']}]
table.html
<table>
<tbody>
<tr style="height:15.0pt">
<td colspan="2"
style="width:130.9pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt"
width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Pickup Location</span></b>
</p>
</td>
<td colspan="3"
style="width:130.1pt; border-top:solid windowtext 1.0pt; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:15.0pt"
width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Description</span></b>
</p>
</td>
<td style="width:1.5pt; padding:0in 0in 0in 0in; height:15.0pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:15.0pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td colspan="2" rowspan="2"
style="width:130.9pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="175">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Some Address</span></b>
</p>
</td>
<td colspan="3" rowspan="2"
style="width:130.1pt; border-top:none; border-left:none; border-bottom:solid black 1.0pt; border-right:solid black 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="173">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">Rubics cube</span></b>
</p>
</td>
<td style="width:1.5pt; padding:0in 0in 0in 0in; height:13.15pt" width="2">
<p class="MsoNormal"></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.15pt" width="0"></td>
</tr>
<tr style="height:15.75pt">
</tr>
<tr style="height:.3in">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">PLTS</span></b><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">total weight</span></b><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">L</span></b><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">W</span></b><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; background:#D9E1F2; padding:0in 5.4pt 0in 5.4pt; height:.3in"
width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif; color:black">H</span></b><b><span
style="font-size:10.0pt; font-family:"Arial",sans-serif"></span></b></p>
</td>
</tr>
<tr style="height:13.9pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif; color:black">1</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">20</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.9pt"
width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:.3pt; padding:0in 0in 0in 0in; height:13.9pt" width="0"></td>
</tr>
<tr style="height:13.15pt">
<td style="width:42.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="56" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif; color:black">2</span></p>
</td>
<td style="width:88.75pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="118" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">60</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:20.15pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="27" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
<td style="width:17.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt; height:13.15pt"
width="23" valign="bottom" nowrap="nowrap">
<p class="MsoNormal" style="text-align:center" align="center"><span
style="font-size:12.0pt; font-family:"Arial",sans-serif">40</span></p>
</td>
</tr>
</tbody>
</table>