In this tutorial, you will learn how to use pipelines to clean up your modeling code.
在本教程中,您将学习如何使用管道来清理建模代码。
Introduction
介绍
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
管道是保持数据预处理和建模代码井井有条的简单方法。 具体来说,管道捆绑了预处理和建模步骤,因此您可以像使用单个步骤一样使用整个捆绑包。
Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
许多数据科学家在没有管道的情况下组合模型,但管道有一些重要的好处。 其中包括:
- Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
- 更简洁的代码: 在预处理的每个步骤中计算数据可能会变得混乱。 使用管道,您无需在每个步骤中手动跟踪训练和验证数据。
- Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
- 错误更少: 误用步骤或忘记预处理步骤的可能性更少。
- Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
- 更容易生产: 将模型从原型转变为可大规模部署的模型可能非常困难。 我们不会在这里讨论许多相关的问题,但管道可以提供帮助。
- More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.
- 模型验证的更多选项: 您将在下一个教程中看到一个示例,其中介绍了交叉验证。
Example
例子
As in the previous tutorial, we will work with the Melbourne Housing dataset.
与上一个教程一样,我们将使用墨尔本住房数据集。
We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train
, X_valid
, y_train
, and y_valid
.
我们不会关注数据加载步骤。 相反,您可以想象您已经在X_train
、X_valid
、y_train
和y_valid
中拥有训练和验证数据。
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('../00 datasets/dansbecker/melbourne-housing-snapshot/melb_data.csv')
# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)
# Divide data into training and validation subsets
# 将数据分为训练集和验证集
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
We take a peek at the training data with the head()
method below. Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal with both!
我们使用下面的head()
方法查看训练数据。 请注意,数据包含分类数据和具有缺失值的列。 有了管道,就可以轻松处理这两件事!
X_train.head()
Type | Method | Regionname | Rooms | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12167 | u | S | Southern Metropolitan | 1 | 5.0 | 3182.0 | 1.0 | 1.0 | 1.0 | 0.0 | NaN | 1940.0 | -37.85984 | 144.9867 | 13240.0 |
6524 | h | SA | Western Metropolitan | 2 | 8.0 | 3016.0 | 2.0 | 2.0 | 1.0 | 193.0 | NaN | NaN | -37.85800 | 144.9005 | 6380.0 |
8413 | h | S | Western Metropolitan | 3 | 12.6 | 3020.0 | 3.0 | 1.0 | 1.0 | 555.0 | NaN | NaN | -37.79880 | 144.8220 | 3755.0 |
2919 | u | SP | Northern Metropolitan | 3 | 13.0 | 3046.0 | 3.0 | 1.0 | 1.0 | 265.0 | NaN | 1995.0 | -37.70830 | 144.9158 | 8870.0 |
6043 | h | S | Western Metropolitan | 3 | 13.3 | 3020.0 | 3.0 | 1.0 | 2.0 | 673.0 | 673.0 | 1970.0 | -37.76230 | 144.8272 | 4217.0 |
We construct the full pipeline in three steps.
我们分三步构建完整的管道。
Step 1: Define Preprocessing Steps
步骤 1:定义预处理步骤
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer
class to bundle together different preprocessing steps. The code below:
与管道如何将预处理和建模步骤捆绑在一起类似,我们使用ColumnTransformer
类将不同的预处理步骤捆绑在一起。 代码如下:
- imputes missing values in numerical data, and
- 插补 numerical 数据中的缺失值,以及
- imputes missing values and applies a one-hot encoding to categorical data.
- 插补 categorical 数据中的缺失值,并进行 one-hot 编码。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
# 预处理数值数据
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
# 预处理分类数据
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
# 将数值和分类数据的预处理结合起来
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
Step 2: Define the Model
第 2 步:定义模型
Next, we define a random forest model with the familiar RandomForestRegressor
class.
接下来,我们使用熟悉的 RandomForestRegressor
类定义随机森林模型。
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
Step 3: Create and Evaluate the Pipeline
步骤 3:创建并评估管道
Finally, we use the Pipeline
class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:
最后,我们使用 Pipeline 类来定义捆绑预处理和建模步骤的管道。 有一些重要的事情需要注意:
- With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
- 通过管道,我们预处理训练数据并在一行代码中拟合模型。 (相反,如果没有管道,我们必须在单独的步骤中进行插补、one-hot 编码和模型训练。如果我们必须处理数值变量和分类变量,这会变得特别混乱!)
- With the pipeline, we supply the unprocessed features in
X_valid
to thepredict()
command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.) - 通过管道,我们将
X_valid
中未处理的特征提供给predict()
命令,管道在生成预测之前自动预处理这些特征。 (但是,如果没有管道,我们必须记住在进行预测之前对验证数据进行预处理。)
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
# 将预处理与建模组合在一个管道里
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
# 预处理训练数据,拟合模型
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
# 预处理验证数据,获取预测值
preds = my_pipeline.predict(X_valid)
# Evaluate the model
# 评估模型
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
MAE: 160679.18917034855
my_pipeline.named_steps
{'preprocessor': ColumnTransformer(transformers=[('num', SimpleImputer(strategy='constant'),
['Rooms', 'Distance', 'Postcode', 'Bedroom2',
'Bathroom', 'Car', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude',
'Propertycount']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Type', 'Method', 'Regionname'])]),
'model': RandomForestRegressor(random_state=0)}
Conclusion
结论
Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.
管道对于清理机器学习代码和避免错误非常有价值,对于具有复杂数据预处理的工作流程尤其有用。
Your Turn
到你了
Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!
使用下一个练习中的管道来使用高级数据预处理技术并改进您的预测!
Have questions or comments? Visit the course discussion forum to chat with other learners.