我正在寻找一种健壮的方法,将name列中匹配某些条件的行拉到一个新的CSV文件中。 我的文件有几千行,每天都要添加新的行,我希望拉出以下行:
My.csv
id name url phone
1 ACME http://www.someurl.com +44123456789
2 http://www.someurl.com http://www.someurl.com +44123456789
3 longsinglewordname http://www.someurl.com +44123456789
4 long multiple words name http://www.someurl.com +44123456789
5 ACME http://www.someurl.com +44123456789
6 acme http://www.someurl.com +44123456789
7 CMEA LLC http://www.someurl.com +44123456789
8 CMEA lLC http://www.someurl.com +44123456789
9 Correct Name http://www.someurl.com +44123456789
10 12345 http://www.someurl.com +44123456789
11 Correct Name2 http://www.someurl.com +44123456789
在我的输出中,我的目标是列出所有符合我的标准的行,并附加一列'issue'来突出显示已经满足的标准。 每一行可以有多个条件。 请参阅第2行(url,longWord)。 目前,我不知道如何在我的代码中实现这一点。 有人能帮忙吗?
所需的new.csv:
id name url phone issue
1 ACME http://www.someurl.com +44123456789 dupe
2 http://www.someurl.com http://www.someurl.com +44123456789 url
longWord
3 longsinglewordname http://www.someurl.com +44123456789 longWord
4 long multiple words name http://www.someurl.com +44123456789 longMultiple
5 ACME http://www.someurl.com +44123456789 dupe
6 acme http://www.someurl.com +44123456789 dupe
7 CMEA LLC http://www.someurl.com +44123456789 regTitle
8 CMEA lLC http://www.someurl.com +44123456789 regTitle
10 12345 http://www.someurl.com +44123456789 numericVal
到目前为止我的代码:
import pandas as pd
df = pd.read_csv("path/to/my.csv")
df1 = pd.concat(g for _, g in df.groupby("name") if len(g) > 1)
#below im trying to get only numeric values within name column but it fails 'Invalid syntax' and pointing at exclamation mark !str_detect
df2 = df[!str_detect(df$name,("([0-9])")),]
#not sure what would be the best way to differentiate count for single words & multiple words
df3 = df[df['name'].apply(lambda x: len(str(x)) > 15)]
df4 = #not sure how to recognize that string has string+'.xxx' or string+'.xx'
#not sure if this is the best way to save/append my results
df1.to_csv('/path/to/my/new.csv', index = False)
df2.to_csv('/path/to/my/new.csv', mode='a', index = False)
df3.to_csv('/path/to/my/new.csv', mode='a', index = False)
df4.to_csv('/path/to/my/new.csv', mode='a', index = False)
欢迎任何帮助。 提前谢谢你!
你的条件很直截了当,只要做到:
duplicates=df['name'].str.lower().duplicated(keep=False)
longtitles = df['name'].str.len() > 15 # this includes single word > 15 chars already
contains = df['name'].str.contains('|'.join(['LLC','inc']),
re.IGNORECASE)
numerics = df['name'].str.match('^\d+$')
urls = df['name'].str.match('https?://')
# output
df[duplicates|longtitles|contains|numerics|urls]
输出:
id name url phone
0 1 ACME http://www.someurl.com +44123456789
1 2 http://www.someurl.com http://www.someurl.com +44123456789
2 3 longsinglewordname http://www.someurl.com +44123456789
3 4 long multiple words name http://www.someurl.com +44123456789
4 5 ACME http://www.someurl.com +44123456789
5 6 acme http://www.someurl.com +44123456789
6 7 CMEA LLC http://www.someurl.com +44123456789
7 8 CMEA lLC http://www.someurl.com +44123456789
9 10 12345 http://www.someurl.com +44123456789