Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

04.course-pipelines【管道】

In this tutorial, you will learn how to use pipelines to clean up your modeling code.

在本教程中,您将学习如何使用管道来清理建模代码。

Introduction

介绍

Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

管道是保持数据预处理和建模代码井井有条的简单方法。 具体来说,管道捆绑了预处理和建模步骤,因此您可以像使用单个步骤一样使用整个捆绑包。

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

许多数据科学家在没有管道的情况下组合模型,但管道有一些重要的好处。 其中包括:

  1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
  2. 更简洁的代码: 在预处理的每个步骤中计算数据可能会变得混乱。 使用管道,您无需在每个步骤中手动跟踪训练和验证数据。
  3. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
  4. 错误更少: 误用步骤或忘记预处理步骤的可能性更少。
  5. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
  6. 更容易生产: 将模型从原型转变为可大规模部署的模型可能非常困难。 我们不会在这里讨论许多相关的问题,但管道可以提供帮助。
  7. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.
  8. 模型验证的更多选项: 您将在下一个教程中看到一个示例,其中介绍了交叉验证。

Example

例子

As in the previous tutorial, we will work with the Melbourne Housing dataset.

与上一个教程一样,我们将使用墨尔本住房数据集

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train, X_valid, y_train, and y_valid.

我们不会关注数据加载步骤。 相反,您可以想象您已经在X_trainX_validy_trainy_valid中拥有训练和验证数据。


import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../00 datasets/dansbecker/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
# 将数据分为训练集和验证集
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

We take a peek at the training data with the head() method below. Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal with both!

我们使用下面的head()方法查看训练数据。 请注意,数据包含分类数据和具有缺失值的列。 有了管道,就可以轻松处理这两件事!

X_train.head()
Type Method Regionname Rooms Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
12167 u S Southern Metropolitan 1 5.0 3182.0 1.0 1.0 1.0 0.0 NaN 1940.0 -37.85984 144.9867 13240.0
6524 h SA Western Metropolitan 2 8.0 3016.0 2.0 2.0 1.0 193.0 NaN NaN -37.85800 144.9005 6380.0
8413 h S Western Metropolitan 3 12.6 3020.0 3.0 1.0 1.0 555.0 NaN NaN -37.79880 144.8220 3755.0
2919 u SP Northern Metropolitan 3 13.0 3046.0 3.0 1.0 1.0 265.0 NaN 1995.0 -37.70830 144.9158 8870.0
6043 h S Western Metropolitan 3 13.3 3020.0 3.0 1.0 2.0 673.0 673.0 1970.0 -37.76230 144.8272 4217.0

We construct the full pipeline in three steps.

我们分三步构建完整的管道。

Step 1: Define Preprocessing Steps

步骤 1:定义预处理步骤

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

与管道如何将预处理和建模步骤捆绑在一起类似,我们使用ColumnTransformer类将不同的预处理步骤捆绑在一起。 代码如下:

  • imputes missing values in numerical data, and
  • 插补 numerical 数据中的缺失值,以及
  • imputes missing values and applies a one-hot encoding to categorical data.
  • 插补 categorical 数据中的缺失值,并进行 one-hot 编码。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
# 预处理数值数据
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
# 预处理分类数据
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
# 将数值和分类数据的预处理结合起来
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Step 2: Define the Model

第 2 步:定义模型

Next, we define a random forest model with the familiar RandomForestRegressor class.

接下来,我们使用熟悉的 RandomForestRegressor 类定义随机森林模型。

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

Step 3: Create and Evaluate the Pipeline

步骤 3:创建并评估管道

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

最后,我们使用 Pipeline 类来定义捆绑预处理和建模步骤的管道。 有一些重要的事情需要注意:

  • With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
  • 通过管道,我们预处理训练数据并在一行代码中拟合模型。 (相反,如果没有管道,我们必须在单独的步骤中进行插补、one-hot 编码和模型训练。如果我们必须处理数值变量和分类变量,这会变得特别混乱!
  • With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)
  • 通过管道,我们将X_valid中未处理的特征提供给predict()命令,管道在生成预测之前自动预处理这些特征。 (但是,如果没有管道,我们必须记住在进行预测之前对验证数据进行预处理。
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
# 将预处理与建模组合在一个管道里
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
# 预处理训练数据,拟合模型
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
# 预处理验证数据,获取预测值
preds = my_pipeline.predict(X_valid)

# Evaluate the model
# 评估模型
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
MAE: 160679.18917034855
my_pipeline.named_steps
{'preprocessor': ColumnTransformer(transformers=[('num', SimpleImputer(strategy='constant'),
                                  ['Rooms', 'Distance', 'Postcode', 'Bedroom2',
                                   'Bathroom', 'Car', 'Landsize', 'BuildingArea',
                                   'YearBuilt', 'Lattitude', 'Longtitude',
                                   'Propertycount']),
                                 ('cat',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='most_frequent')),
                                                  ('onehot',
                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                  ['Type', 'Method', 'Regionname'])]),
 'model': RandomForestRegressor(random_state=0)}

Conclusion

结论

Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

管道对于清理机器学习代码和避免错误非常有价值,对于具有复杂数据预处理的工作流程尤其有用。

Your Turn

到你了

Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!

使用下一个练习中的管道来使用高级数据预处理技术并改进您的预测!


Have questions or comments? Visit the course discussion forum to chat with other learners.

04.course-pipelines【管道】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top