閱讀(742) 書(shū)簽贊(0) 我要糾錯(cuò)

scikit-learn 管道和復(fù)合估算器

2023-02-20 14:34 更新

轉(zhuǎn)換器（Transformers ）通常與分類(lèi)器，回歸器或其他估計(jì)器組合在一起，以構(gòu)建復(fù)合估計(jì)器。最常用的工具是管道（Pipeline）。管道通常與FeatureUnion結(jié)合使用， FeatureUnion將轉(zhuǎn)換器的輸出連接到復(fù)合特征空間中。 TransformedTargetRegressor處理轉(zhuǎn)換目標(biāo) （即log-transform y）。否則，Pipelines僅變換觀察到的數(shù)據(jù)（X）。

6.1.1 管道（Pipeline）：鏈?zhǔn)焦烙?jì)器

Pipeline可用于將多個(gè)估計(jì)器鏈接為一個(gè)。這很有用，因?yàn)樵谔幚頂?shù)據(jù)時(shí)通常會(huì)有固定的步驟順序，例如特征選擇，歸一化和分類(lèi)。Pipeline在這里有多種用途：

便捷和封裝

只需要對(duì)數(shù)據(jù)集調(diào)用一次fit和predict，就可以適配一系列評(píng)估器。
聯(lián)合參數(shù)選擇

您可以對(duì)管道中估計(jì)器的所有參數(shù)進(jìn)行一次網(wǎng)格搜索。
安全

轉(zhuǎn)換器（ transformers）和預(yù)測(cè)器（predictors）使用相同的樣本訓(xùn)練，管道有助于避免將統(tǒng)計(jì)數(shù)據(jù)從測(cè)試數(shù)據(jù)泄漏到經(jīng)過(guò)交叉驗(yàn)證訓(xùn)練的模型中。

除最后一個(gè)管道外，管道中的所有估計(jì)器都必須是轉(zhuǎn)換器（即必須具有轉(zhuǎn)換（transform）方法）。最后的估計(jì)器可以是任何類(lèi)型（轉(zhuǎn)換器，分類(lèi)器等）。

6.1.1.1 用法

6.1.1.1.1 構(gòu)造

管道是由包含（鍵，值）對(duì)的列表構(gòu)建的，其中鍵是包含此步驟名稱(chēng)的字符串，而值是估計(jì)器對(duì)象：

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])

功能函數(shù)make_pipeline是構(gòu)造管道的簡(jiǎn)寫(xiě)。它使用可變數(shù)量的估計(jì)器并返回管道，而且自動(dòng)填充名稱(chēng)：

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB())
Pipeline(steps=[('binarizer', Binarizer()), ('multinomialnb', MultinomialNB())])

6.1.1.1.2 訪(fǎng)問(wèn)步驟

管道的估計(jì)器在steps屬性中以列表形式存儲(chǔ)，但是可以對(duì)管道建立索引并通過(guò)索引或名稱(chēng)（通過(guò)[idx]）來(lái)訪(fǎng)問(wèn)管道：

>>> pipe.steps[0]
('reduce_dim', PCA())
>>> pipe[0]
PCA()
>>> pipe['reduce_dim']
PCA()

管道的named_steps屬性允許在交互式環(huán)境中按名稱(chēng)和制表符（tab）補(bǔ)全的方式訪(fǎng)問(wèn)步驟：

>>> pipe.named_steps.reduce_dim is pipe['reduce_dim']
True

也可以使用通常用于Python序列（例如列表或字符串）的切片方法來(lái)提取子管道（盡管步長(zhǎng)只能為1）。這對(duì)于僅執(zhí)行某些轉(zhuǎn)換（或其逆轉(zhuǎn)換）是很方便的：

>>> pipe[:1]
Pipeline(steps=[('reduce_dim', PCA())])
>>> pipe[-1:]
Pipeline(steps=[('clf', SVC())])

6.1.1.1.3 嵌套參數(shù)

可以使用<estimator>__<parameter>語(yǔ)法訪(fǎng)問(wèn)管道中估計(jì)器的參數(shù) ：

>>> pipe.set_params(clf__C=10)
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])

這對(duì)進(jìn)行網(wǎng)格搜索特別重要：

>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
...                   clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

各個(gè)單獨(dú)的步驟可以替換為多個(gè)參數(shù)，并將非最終步驟設(shè)置為'passthrough'：

>>> from sklearn.linear_model import LogisticRegression
>>> param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
...                   clf=[SVC(), LogisticRegression()],
...                   clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

管道的估算器可以通過(guò)索引檢索：

>>> pipe[0]
PCA()

或通過(guò)名稱(chēng)：

>>> pipe['reduce_dim']
PCA()

示例

管道Anova SVM
用于文本特征提取和評(píng)估的示例管道
管道：鏈接PCA和邏輯回歸
RBF內(nèi)核的顯式特征圖近似
SVM-Anova：具有單變量特征選擇的SVM
使用Pipeline和GridSearchCV選擇降維

也可以參閱：

復(fù)合估計(jì)量和參數(shù)空間

6.1.1.2 注釋

管道調(diào)用fit方法與依次調(diào)用每個(gè)估計(jì)器的fit方法效果相同（transform輸入并將其傳遞到下一步）。管道具有管道中最后一個(gè)估計(jì)器的所有方法，即，如果最后一個(gè)估計(jì)器是一個(gè)分類(lèi)器，則Pipeline可以用作分類(lèi)器。如果最后一個(gè)估計(jì)器是轉(zhuǎn)換器，那么管道可以用作轉(zhuǎn)換器。

6.1.1.3 緩存轉(zhuǎn)換器：避免重復(fù)計(jì)算

適配轉(zhuǎn)換器很耗費(fèi)計(jì)算資源，通過(guò)設(shè)置memory參數(shù)， Pipeline將在調(diào)用fit方法后緩存每個(gè)轉(zhuǎn)換器。如果參數(shù)和輸入數(shù)據(jù)一致，則此功能可避免重復(fù)計(jì)算適配管道內(nèi)的轉(zhuǎn)換器。一個(gè)典型的例子是網(wǎng)格搜索中，轉(zhuǎn)換器只需適配一次即可應(yīng)用于每種配置。

memory參數(shù)用以緩存轉(zhuǎn)換器。 memory可以是包含緩存轉(zhuǎn)換器的目錄的字符串，也可以是一個(gè)joblib.Memory 對(duì)象：

>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> from sklearn.decomposition import PCA
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> cachedir = mkdtemp()
>>> pipe = Pipeline(estimators, memory=cachedir)
>>> pipe
Pipeline(memory=...,
         steps=[('reduce_dim', PCA()), ('clf', SVC())])
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)

警告：緩存轉(zhuǎn)換器的副作用

使用未啟用緩存功能的 Pipeline ，可以檢查原始實(shí)例，例如：
>>> from sklearn.datasets import load_digits
>>> X_digits, y_digits = load_digits(return_X_y=True)
>>> pca1 = PCA()
>>> svm1 = SVC()
>>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
>>> pipe.fit(X_digits, y_digits)
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
>>> # The pca instance can be inspected directly
>>> print(pca1.components_)
 [[-1.77484909e-19  ... 4.07058917e-18]]  
啟用緩存會(huì)在適配前觸發(fā)轉(zhuǎn)換器的克隆。因此，管道的轉(zhuǎn)換器實(shí)例不能直接進(jìn)行查看。在下面的示例中，訪(fǎng)問(wèn)PCA實(shí)例pca2 將引發(fā)AttributeError，因?yàn)?code style="font-size: 14px; word-wrap: break-word; padding: 2px 4px; border-radius: 4px; margin: 0 2px; color: #1e6bb8; background-color: rgba(27,31,35,.05); font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; word-break: break-all;">pca2是未進(jìn)行適配的轉(zhuǎn)換器，應(yīng)該使用屬性named_steps檢查管道中的評(píng)估器：
>>> cachedir = mkdtemp()
>>> pca2 = PCA()
>>> svm2 = SVC()
>>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
...                        memory=cachedir)
>>> cached_pipe.fit(X_digits, y_digits)
Pipeline(memory=...,
     steps=[('reduce_dim', PCA()), ('clf', SVC())])
>>> print(cached_pipe.named_steps['reduce_dim'].components_)
 [[-1.77484909e-19  ... 4.07058917e-18]]
>>> # Remove the cache directory
>>> rmtree(cachedir)

示例：

使用Pipeline和GridSearchCV選擇降維

6.1.2 回歸中轉(zhuǎn)換目標(biāo)

TransformedTargetRegressor在擬合回歸模型之前先轉(zhuǎn)換目標(biāo)y。通過(guò)逆變換將預(yù)測(cè)映射回原始空間。它以將用于預(yù)測(cè)的回歸器和將應(yīng)用于目標(biāo)變量的轉(zhuǎn)換器作為參數(shù)：

>>> import numpy as np
>>> from sklearn.datasets import load_boston
>>> from sklearn.compose import TransformedTargetRegressor
>>> from sklearn.preprocessing import QuantileTransformer
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_boston(return_X_y=True)
>>> transformer = QuantileTransformer(output_distribution='normal')
>>> regressor = LinearRegression()
>>> regr = TransformedTargetRegressor(regressor=regressor,
...                                   transformer=transformer)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> regr.fit(X_train, y_train)
TransformedTargetRegressor(...)
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
R2 score: 0.67
>>> raw_target_regr = LinearRegression().fit(X_train, y_train)
>>> print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))
R2 score: 0.64

對(duì)于簡(jiǎn)單的轉(zhuǎn)換，可以傳遞一對(duì)函數(shù)，而不是Transformer對(duì)象，來(lái)定義轉(zhuǎn)換及其逆映射：

>>> def func(x):
...     return np.log(x)
>>> def inverse_func(x):
...     return np.exp(x)

隨后，該對(duì)象創(chuàng)建為：

>>> regr = TransformedTargetRegressor(regressor=regressor,
...                                   func=func,
...                                   inverse_func=inverse_func)
>>> regr.fit(X_train, y_train)
TransformedTargetRegressor(...)
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
R2 score: 0.65

默認(rèn)情況下，所提供的函數(shù)在每次擬合時(shí)都檢查是否彼此相反。但是，可以通過(guò)設(shè)置check_inverse為False來(lái)繞過(guò)此檢查：

>>> def inverse_func(x):
...     return x
>>> regr = TransformedTargetRegressor(regressor=regressor,
...                                   func=func,
...                                   inverse_func=inverse_func,
...                                   check_inverse=False)
>>> regr.fit(X_train, y_train)
TransformedTargetRegressor(...)
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
R2 score: -4.50

**注意：**所述轉(zhuǎn)化可以通過(guò)設(shè)定任一方式觸發(fā)，transformer或雙功能參數(shù)func以及inverse_func。但是，同時(shí)設(shè)置這兩個(gè)選項(xiàng)將引發(fā)錯(cuò)誤。

示例

在回歸模型中轉(zhuǎn)換目標(biāo)的效果

6.1.3 聯(lián)合特征（FeatureUnion）：復(fù)合特征空間

FeatureUnion將多個(gè)轉(zhuǎn)換器對(duì)象組合為一個(gè)新的轉(zhuǎn)換器，該轉(zhuǎn)換器將多個(gè)轉(zhuǎn)換器的輸出合并在一起。

一個(gè) FeatureUnion包含一個(gè)轉(zhuǎn)換器對(duì)象列表。在擬合期間，每個(gè)參數(shù)都獨(dú)立地適配數(shù)據(jù)。并行應(yīng)用這些轉(zhuǎn)換器，并將它們輸出的特征矩陣并排連接成一個(gè)更大的矩陣。

如果您想對(duì)數(shù)據(jù)的每個(gè)字段應(yīng)用不同的轉(zhuǎn)換，請(qǐng)參閱相關(guān)的類(lèi)sklearn.compose.ColumnTransformer （請(qǐng)參閱用戶(hù)指南）。

FeatureUnion具有與Pipeline相同的目的--便利性及聯(lián)合參數(shù)估計(jì)和驗(yàn)證。

結(jié)合FeatureUnion和Pipeline可以創(chuàng)建復(fù)雜的模型。

（一個(gè)FeatureUnion無(wú)法檢查兩個(gè)轉(zhuǎn)換器是否可能產(chǎn)生相同的功能。僅當(dāng)特征集合不相交時(shí)才產(chǎn)生并集，并確保是調(diào)用者的責(zé)任。）

6.1.3.1 用法

FeatureUnion是使用(key, value)對(duì)的列表構(gòu)建的，其中key是要為給定的轉(zhuǎn)換指定的名稱(chēng)（任意字符串；它僅用作標(biāo)識(shí)符），而value1是估計(jì)對(duì)象：

>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined
FeatureUnion(transformer_list=[('linear_pca', PCA()),
                               ('kernel_pca', KernelPCA())])

像管道一樣，特征聯(lián)合也有一個(gè)簡(jiǎn)稱(chēng)的構(gòu)造函數(shù)：make_union，該構(gòu)造函數(shù)不需要顯式命名組件。

像Pipeline一樣，可以使用set_params替換各個(gè)單獨(dú)的步驟，并通過(guò)設(shè)置'drop'來(lái)忽略它們：

>>> combined.set_params(kernel_pca='drop')
FeatureUnion(transformer_list=[('linear_pca', PCA()),
                               ('kernel_pca', 'drop')])

示例：

串聯(lián)多個(gè)特征提取方法

6.1.4 異構(gòu)數(shù)據(jù)的列轉(zhuǎn)換器

許多數(shù)據(jù)集包含不同類(lèi)型的特征，例如文本，浮點(diǎn)數(shù)和日期，其中每種類(lèi)型的特征都需要單獨(dú)的預(yù)處理或特征提取步驟。通常，在應(yīng)用scikit-learn方法之前，對(duì)數(shù)據(jù)進(jìn)行預(yù)處理最容易，例如使用pandas。由于下列原因之一，在將數(shù)據(jù)傳遞給scikit-learn之前進(jìn)行處理可能會(huì)出現(xiàn)問(wèn)題：

將測(cè)試數(shù)據(jù)中的統(tǒng)計(jì)信息合并到預(yù)處理器中，會(huì)使交叉驗(yàn)證得分不可靠（稱(chēng)為數(shù)據(jù)泄漏），例如在定標(biāo)器或估算缺失值的情況下。
您可能希望通過(guò)參數(shù)搜索調(diào)整預(yù)處理器中的參數(shù)

ColumnTransformer有助于在管道內(nèi)對(duì)數(shù)據(jù)的不同列進(jìn)行不同的變換，以防止數(shù)據(jù)泄漏并且可以對(duì)其進(jìn)行參數(shù)化。ColumnTransformer適用于數(shù)組，稀疏矩陣和 pandas DataFrames。

可以對(duì)每列應(yīng)用不同的轉(zhuǎn)換，例如預(yù)處理或特定的特征提取方法：

>>> import pandas as pd
>>> X = pd.DataFrame(
...     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
...      'title': ["His Last Bow", "How Watson Learned the Trick",
...                "A Moveable Feast", "The Grapes of Wrath"],
...      'expert_rating': [5, 3, 4, 5],
...      'user_rating': [4, 5, 4, 3]})

對(duì)于此數(shù)據(jù)，我們可能希望使用preprocessing.OneHotEncoder將city列編碼為類(lèi)別變量，同時(shí)將feature_extraction.text.CountVectorizer 應(yīng)用于“title列。由于我們可能在同一列上使用多種特征提取方法，因此我們給每個(gè)轉(zhuǎn)換器一個(gè)唯一的名稱(chēng)，例如'city_category'和'title_bow'。默認(rèn)情況下，其余的評(píng)分列將被忽略（remainder='drop'）：

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(dtype='int'),['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder='drop')

>>> column_trans.fit(X)
ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
                                 ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']

>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)

在上面的示例中， CountVectorizer期望將一維數(shù)組作為輸入，因此將列指定為字符串（'title'）。然而，由于preprocessing.OneHotEncoder 像其他大多數(shù)轉(zhuǎn)換器一樣，都希望使用2D數(shù)據(jù)，因此在這種情況下，您需要將列指定為字符串列表（['city']）。

除了標(biāo)量或單個(gè)項(xiàng)目列表之外，可以將列選擇指定為多個(gè)項(xiàng)目的列表，整數(shù)數(shù)組，切片，布爾掩碼或使用make_column_selector。make_column_selector用于根據(jù)數(shù)據(jù)類(lèi)型或列名稱(chēng)來(lái)選擇列：

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.compose import make_column_selector
>>> ct = ColumnTransformer([
...       ('scale', StandardScaler(),
...       make_column_selector(dtype_include=np.number)),
...       ('onehot',
...       OneHotEncoder(),
...       make_column_selector(pattern='city', dtype_include=object))])
>>> ct.fit_transform(X)
array([[ 0.904...,  0.      ,  1. ,  0. ,  0. ],
       [-1.507...,  1.414...,  1. ,  0. ,  0. ],
       [-0.301...,  0.      ,  0. ,  1. ,  0. ],
       [ 0.904..., -1.414...,  0. ,  0. ,  1. ]])

如果輸入是DataFrame，則字符串可以引用列，整數(shù)始終被解釋為位置列。

通過(guò)設(shè)置，我們可以通過(guò)設(shè)置 remainder='passthrough'保留其余的評(píng)分欄。這些值將附加到轉(zhuǎn)換的末尾：

>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(dtype='int'),['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder='passthrough')

>>> column_trans.fit_transform(X)
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)

通過(guò)remainder參數(shù)設(shè)置來(lái)估計(jì)改造其余評(píng)級(jí)列。轉(zhuǎn)換后的值將附加到轉(zhuǎn)換的末尾：

>>> from sklearn.preprocessing import MinMaxScaler
>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(), ['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder=MinMaxScaler())

>>> column_trans.fit_transform(X)[:, -2:]
array([[1. , 0.5],
       [0. , 1. ],
       [0.5, 0.5],
       [1. , 0. ]])

make_column_transformer函數(shù)可用于更輕松地創(chuàng)建ColumnTransformer對(duì)象。具體來(lái)說(shuō)，名稱(chēng)將自動(dòng)給出。以上示例的等效項(xiàng)為：

>>> from sklearn.compose import make_column_transformer
>>> column_trans = make_column_transformer(
...     (OneHotEncoder(), ['city']),
...     (CountVectorizer(), 'title'),
...     remainder=MinMaxScaler())
>>> column_trans
ColumnTransformer(remainder=MinMaxScaler(),
                  transformers=[('onehotencoder', OneHotEncoder(), ['city']),
                                ('countvectorizer', CountVectorizer(),
                                 'title')])

6.1.5 可視化復(fù)合估計(jì)器

在jupyter notebook中顯示估算器時(shí)，可以用HTML形式顯示估算器。這對(duì)于許多使用估算器診斷或可視化管道很有用。通過(guò)在 sklearn.set_config中設(shè)置display選項(xiàng)激活此可視化：

>>> from sklearn import set_config
>>> set_config(display='diagram')   
>>> # diplays HTML representation in a jupyter context
>>> column_trans

HTML輸出的示例可以在具有混合類(lèi)型的Column Transformer的輸出為HTML形式的Pipeline中看到 ?；蛘撸梢允褂靡韵旅顚TML寫(xiě)入文件：estimator_html_repr

>>> from sklearn.utils import estimator_html_repr
>>> with open('my_estimator.html', 'w') as f:  
...     f.write(estimator_html_repr(clf))

示例：

具有異構(gòu)數(shù)據(jù)源的列轉(zhuǎn)換器
混合型列式轉(zhuǎn)換器

以上內(nèi)容是否對(duì)您有幫助：

← scikit-learn 提供的繪圖工具

scikit-learn 特征提取 →

寫(xiě)筆記

我要補(bǔ)充