提问者:小点点

无法在熊猫read_csv读取csv文件[重复]


我正在尝试在Pandas中读取一个csv文件。该文件的格式似乎很奇怪,我是从LinkedIN竞选经理那里下载的。你能帮我正常阅读这个文件吗?这是代码:

path = r'C:\Users\FilePath' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

这是错误:

UnicodeDecodeError                        Traceback (most recent call 

last)
~\AppData\Local\Temp/ipykernel_11340/2382686370.py in <module>
      3 path = r'C:\Users\n' # use your path
      4 all_files = glob.glob(os.path.join(path, "*.csv"))
----> 5 dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
      6 dfAllDataLI = dfAllDataLI.fillna('')
      7 

c:\Userspackages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

c:\Usersshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    292     ValueError: Indexes have overlapping values: ['a']
    293     """
--> 294     op = _Concatenator(
    295         objs,
    296         axis=axis,

c:\Useronda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    346             objs = [objs[k] for k in keys]
...
c:\Useda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

c:\Users\ackages\pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Campaign Performance Report (in UTC)                                                                
Report Start: April 1, 2022, 12:00 AM                                                               
Report End: April 19, 2022, 11:59 PM                                                                
Date Generated: September 7, 2022, 1:12 PM                                                              
                                                                
Start Date (in UTC) Account Name    Campaign Group Name Campaign Group ID   Campaign Name   Campaign ID Campaign Type   Campaign Start Date Campaign Group Start Date   Campaign End Date   Total Budget    Clicks  Impressions Average CPM Average CPC Avg. Last Day Reach Video Completions
4/19/2022   Wiener Stadtwerke GmbH_iprospect    WST_Content_Promotion_2022  622214964   14.04. | Spendeaktion UKR | reach   194421704   Sponsored Update    4/19/2022   3/8/2022    4/30/2022   600 23  3109    17.22   2.33    3096    58

共1个答案

匿名用户

该文件在列标题前有5个非CSV行。

令人高兴的是,read_csv允许您跳过这些行。您还需要指定该文件的文本编码(UTF-16LE,而不是UTF-8)和分隔符(它是制表符分隔的):

import pandas as pd

df = pd.read_csv('csv file.csv', skiprows=5, encoding='utf-16le', sep='\t')
print(df.columns)

产出

Index(['Start Date (in UTC)', 'Account Name', 'Campaign Group Name',
       'Campaign Group ID', 'Campaign Name', 'Campaign ID', 'Campaign Type',
       'Campaign Start Date', 'Campaign Group Start Date', 'Campaign End Date',
       'Total Budget', 'Clicks', 'Impressions', 'Average CPM', 'Average CPC',
       'Avg. Last Day Reach', 'Video Completions'],
      dtype='object')