在Python中使用while函数将短语更改为向量


问题内容

我想将以下短语更改为带有sklearn的向量:

Article 1. It is not good to eat pizza after midnight
Article 2. I wouldn't survive a day withouth stackexchange
Article 3. All of these are just random phrases
Article 4. To prove if my experiment works.
Article 5. The red dog jumps over the lazy fox

我得到以下代码:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

n=0
while n < 5:
   n = n + 1
   a = ('Article %(number)s' % {'number': n})
   print(a)
   with open("LISR2.txt") as openfile:
     for line in openfile:
       if a in line:
           X=line
           print(vectorizer.fit_transform(X))

这给了我以下错误:

ValueError: Iterable over raw text documents expected, string object received.

为什么会这样?我知道这应该可行,因为如果我单独输入:

X=("It is not good to eat pizza","I wouldn't survive a day", "All of these")

print(vectorizer.fit_transform(X))

它给了我我想要的载体。

(0, 8)  1
(0, 2)  1
(0, 11) 1
(0, 3)  1
(0, 6)  1
(0, 4)  1
(0, 5)  1
(1, 1)  1
(1, 9)  1
(1, 12) 1
(2, 10) 1
(2, 7)  1
(2, 0)  1

问题答案:

看一下文档。它说CountVectorizer.fit_transform期望字符串可迭代(例如,字符串
列表 )。您正在传递 单个字符串

这很有意义,scikit中的fit_transform做两件事:1)学习模型(拟合)2)将模型应用于数据(变换)。您要构建一个矩阵,其中列是词汇表中的所有单词,而行对应于文档。为此,您需要了解语料库中的整个词汇表(所有列)。