feature-selection

Feature/Variable importance after a PCA analysis

Feature/Variable importance after a PCA analysis Question: I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the …

Total answers: 3

scikit learn – feature importance calculation in decision trees

scikit learn – feature importance calculation in decision trees Question: I’m trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing. For example: from StringIO import StringIO from sklearn.datasets import load_iris from sklearn.tree …

Total answers: 2

All intermediate steps should be transformers and implement fit and transform

All intermediate steps should be transformers and implement fit and transform Question: I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code. m = ExtraTreesClassifier(n_estimators = 10) m.fit(train_cv_x,train_cv_y) sel = SelectFromModel(m, prefit=True) X_new = sel.transform(train_cv_x) clf = RandomForestClassifier(5000) model = …

Total answers: 3

Information Gain calculation with Scikit-learn

Information Gain calculation with Scikit-learn Question: I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. the Information Gain is defined as H(Class) – H(Class | Attribute), where H is the entropy. in weka, this would be calculated …

Total answers: 3

Random Forest Feature Importance Chart using Python

Random Forest Feature Importance Chart using Python Question: I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv(“MT_reduced.csv”) df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to …

Total answers: 8

Logistic Regression: How to find top three feature that have highest weights?

Logistic Regression: How to find top three feature that have highest weights? Question: I am working on UCI breast cancer dataset and trying to find the top 3 features that have highest weights. I was able to find the weight of all features using logmodel.coef_ but how can I get the feature names? Below is …

Total answers: 2

The easiest way for getting feature names after running SelectKBest in Scikit Learn

The easiest way for getting feature names after running SelectKBest in Scikit Learn Question: I would like to make supervised learning. Until now I know to do supervised learning to all features. However, I would like also to conduct experiment with the K best features. I read the documentation and found the in Scikit learn …

Total answers: 9

How is the feature score(/importance) in the XGBoost package calculated?

How is the feature score(/importance) in the XGBoost package calculated? Question: The command xgb.importance returns a graph of feature importance measured by an f score. What does this f score represent and how is it calculated? Output: Graph of feature importance Asked By: ishido || Source Answers: This is a metric that simply sums up …

Total answers: 2

Linear regression analysis with string/categorical features (variables)?

Linear regression analysis with string/categorical features (variables)? Question: Regression algorithms seem to be working on features represented as numbers. For example: This data set doesn’t contain categorical features/variables. It’s quite clear how to do regression on this data and predict price. But now I want to do a regression analysis on data that contain categorical …

Total answers: 4

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

Understanding the `ngram_range` argument in a CountVectorizer in sklearn Question: I’m a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary = [‘hi ‘, ‘bye’, ‘run away’] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ …

Total answers: 1