SkLearnDataFlow
Scikit-learn
is an open source machine learning library thatsupports supervised and unsupervised learning
. It also provides various tools formodel fitting
,data preprocessing
,model selection
andevaluation
, and manyother utilities
.
1. Fit&Predict
- The samples matrix (or design matrix) (https://scikit-learn.org/stable/glossary.html#term-X). The size of
X
is typically(n_samples, n_features)
, which means thatsamples are represented as rows
andfeatures are represented as columns
.- The target values y which are
real numbers for regression tasks
, orintegers for classification (or any other discrete set of values)
. For unsupervized learning tasks,y
does not need to be specified.y
is usually 1d array where thei
th entry corresponds to the target of thei
th sample (row) ofX
.- Both
X
andy
are usually expected to benumpy arrays
or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1, 2, 3], # 2 samples, 3 features
[11, 12, 13]]
y = [0, 1] # classes of each sample
clf.fit(X, y)
clf.predict(X) # predict classes of the training data
clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
2. Pipelines
- Construction
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB())
//或者
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators) #类似数组形式访问
- Nested parameters
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
- FeatureUnion
A
FeatureUnion
is built using a list of(key, value)
pairs, where thekey
is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) andvalue
is an estimator object:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
- ColumnTransformer for Heterogeneous data
Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:
- Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
- You may want to include the parameters of the preprocessors in a parameter search.
import pandas as pd
X = pd.DataFrame(
{'city': ['London', 'London', 'Paris', 'Sallisaw'],
'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
[('city_category', OneHotEncoder(dtype='int'),['city']),
('title_bow', CountVectorizer(), 'title')],
remainder='drop')
column_trans.fit(X)
column_trans.get_feature_names()
column_trans.transform(X).toarray()
// make_column_selector: select columns based on data type or column name
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
ct = ColumnTransformer([
('scale', StandardScaler(),
make_column_selector(dtype_include=np.number)),
('onehot',
OneHotEncoder(),
make_column_selector(pattern='city', dtype_include=object))])
ct.fit_transform(X)
3. Model Evaluation
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()
result = cross_validate(lr, X, y) # defaults to 5-fold CV
result['test_score'] # r_squared score is high because dataset is easy
4. Parameter Search
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
'max_depth': randint(5, 10)}
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
n_iter=5,
param_distributions=param_distributions,
random_state=0)
search.fit(X_train, y_train)
search.best_params_
# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)