我建立一個分類模型使用以下數據幀的120000條記錄(5條記錄樣本如圖所示):
通過這些數據,我已經建立了以下模型:
從sklearn。從sklearn.feature_extraction model_selection train_test_split進口。文本從sklearn.feature_extraction進口CountVectorizer。文本從sklearn.feature_extraction進口TfidfTransformer。文本從sklearn進口TfidfVectorizer。從sklearn naive_bayes MultinomialNB進口。feature_selection進口VarianceThreshold模型= MultinomialNB () X_train X_test, y_train, y_test = train_test_split (df2 [' descrp_clean '], df2 [' group_name '], random_state = 0, test_size = 0.25,分層= df2 [' group_name ']) #為每一個記錄,計算tf-idf # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # tfidf = TfidfVectorizer (min_df = 3, ngram_range = (1、3) # X_train: (1) tfidf,(2)減少dimentionality # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # x_train_tfidf = tfidf.fit_transform (X_train) VT_reduce = VarianceThreshold(閾值= 0.000005)x_train_tfidf_reduced = VT_reduce.fit_transform (x_train_tfidf) #估計樸素貝葉斯模型# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # clf =模型。fit (x_train_tfidf_reduced y_train) # X_test:應用方差閾值# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # x_test_tfidf = tfidf.transform (X_test) x_test_tfidf_reduced = VT_reduce.transform (x_test_tfidf) #預測使用模型# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # y_pred = model.predict (x_test_tfidf_reduced) #比較實際預測結果# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # model.score (x_test_tfidf_reduced y_test) * 100
我可以創建一個dataframe展示單詞標記之前應用方差閾值:
X_train_tokens = tfidf.get_feature_names () x_train_df = pd.DataFrame (X_train_tokens) x_train_df.tail (5)
後減少方差特性減少到21758:
問題我怎樣創建一個dataframe x_train_df特性後應用方差減少將顯示我的21758的特性?
這更多的是一種scikit-learn問題比一個磚的問題。但是我認為在VT_reduced.get_support()可能是你尋找的東西: