我有一个df,我需要数一下每个组有多少次是这个词。 我需要找到这个单词(不是作为子字符串),如果它旁边有标点符号,我也需要计算,例如:
id group text
1 a hey there
2 c no you can
3 a yes yes yes
4 b yes or no
5 b you need to say yes.
6 a yes you can
7 d yes!
8 c no&
9 b ok
结果:
group count
a 2
b 2
c 0
d 1
我试了一下:
sql_q = spark.sql("select group, count(*) as count from my_table where text LIKE ' yes' or text LIKE 'yes ' or text LIKE ' yes ' group by group")
试试这个-
val sql_q = spark.sql(
"""
|select group, sum(
| case when (text rlike '(?i)^.*yes.*$') then 1 else 0 end
| ) as count
|from my_table group by group
""".stripMargin)
sql_q.show(false)
/**
* +-----+-----+
* |group|count|
* +-----+-----+
* |a |2 |
* |c |0 |
* |d |1 |
* |b |2 |
* +-----+-----+
*/