This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.
In this exercise, you will use pipelines to improve the efficiency of your machine learning code.
在本练习中,您将使用管道来提高机器学习代码的效率。
Setup
设置
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")
Setup Complete
You will work with data from the Housing Prices Competition for Kaggle Learn Users.
您将使用 Kaggle Learn 用户房价竞赛 中的数据。
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
在不进行任何更改的情况下运行下一个代码单元,以加载X_train
、X_valid
、y_train
和y_valid
中的训练集和验证集。 测试集加载到X_test
中。
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y,
train_size=0.8, test_size=0.2,
random_state=0)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
X_train.head()
MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Condition1 | Condition2 | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
619 | RL | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Norm | Norm | ... | 774 | 0 | 108 | 0 | 0 | 260 | 0 | 0 | 7 | 2007 |
871 | RL | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | PosN | Norm | ... | 308 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 |
93 | RL | Pave | Grvl | IR1 | HLS | AllPub | Inside | Gtl | Norm | Norm | ... | 432 | 0 | 0 | 44 | 0 | 0 | 0 | 0 | 8 | 2009 |
818 | RL | Pave | NaN | IR1 | Lvl | AllPub | CulDSac | Gtl | Norm | Norm | ... | 857 | 150 | 59 | 0 | 0 | 0 | 0 | 0 | 7 | 2008 |
303 | RL | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Norm | Norm | ... | 843 | 468 | 81 | 0 | 0 | 0 | 0 | 0 | 1 | 2006 |
5 rows × 76 columns
The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.
下一个代码单元使用教程中的代码来预处理数据并训练模型。 运行此代码而不进行任何更改。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
clf.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
print('MAE:', mean_absolute_error(y_valid, preds))
MAE: 17614.81993150685
The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.
该代码生成的平均绝对误差 (MAE) 值约为 17862。 在下一步中,您将修改代码以做得更好。
Step 1: Improve the performance
第 1 步:提高性能
Part A
A 部分
Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:
现在轮到你了! 在下面的代码单元中,定义您自己的预处理步骤和随机森林模型。 填写以下变量的值:
numerical_transformer
categorical_transformer
model
To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.
要通过这部分练习,您只需定义有效的预处理步骤和随机森林模型。
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median') # Your code here
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
]) # Your code here
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = model = RandomForestRegressor(n_estimators=100, random_state=0) # Your code here
# Check your answer
step_1.a.check()
Correct
# Lines below will give you a hint or solution code
# step_1.a.hint()
step_1.a.solution()
Solution:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)
Part B
B 部分
Run the code cell below without changes.
运行下面的代码单元而不进行任何更改。
To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)
要通过此步骤,您需要在 A 部分 中定义一个管道,该管道的 MAE 低于上面的代码。 我们鼓励您花点时间在这里尝试许多不同的方法,看看您能获得多低的 MAE! (如果您的代码未通过,请修改A部分中的预处理步骤和模型)
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
# Check your answer
step_1.b.check()
MAE: 17514.986232876712
Correct
# Line below will give you a hint
step_1.b.hint()
Hint: Please see the hint from Part A to get some ideas for how to change the preprocessing steps and model to get better performance.
Step 2: Generate test predictions
第 2 步:生成测试预测
Now, you'll use your trained model to generate predictions with the test data.
现在,您将使用经过训练的模型通过测试数据生成预测。
# Preprocessing of test data, fit model
# 预处理测试数据,拟合模型
preds_test = my_pipeline.predict(X_test) # Your code here
# Check your answer
step_2.check()
Correct
# Lines below will give you a hint or solution code
# step_2.hint()
step_2.solution()
Solution:
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)
Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.
在不进行任何更改的情况下运行下一个代码单元,将结果保存到可直接提交给竞赛的 CSV 文件中。
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Submit your results
提交你的结果
Once you have successfully completed Step 2, you're ready to submit your results to the leaderboard! If you choose to do so, make sure that you have already joined the competition by clicking on the Join Competition button at this link.
成功完成第 2 步后,您就可以将结果提交到排行榜了! 如果您选择这样做,请确保您已经通过点击此链接。
- Begin by clicking on the Save Version button in the top right corner of the window. This will generate a pop-up window.
- 首先单击窗口右上角的 保存版本 按钮。 这将生成一个弹出窗口。
- Ensure that the Save and Run All option is selected, and then click on the Save button.
- 确保选择“保存并运行全部”选项,然后单击“保存”按钮。
- This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- 这会在笔记本的左下角生成一个窗口。 运行完成后,单击“保存版本”按钮右侧的数字。 这会在屏幕右侧显示版本列表。 单击最新版本右侧的省略号 (...),然后选择 在查看器中打开。 这将带您进入同一页面的查看模式。 您需要向下滚动才能返回到这些说明。
- Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the Submit button to submit your results to the leaderboard.
- 单击屏幕右侧的“输出”选项卡。 然后,单击您要提交的文件,然后单击“提交”按钮将您的结果提交到排行榜。
You have now successfully submitted to the competition!
您现在已成功向比赛提交结果!
If you want to keep working to improve your performance, select the Edit button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.
如果您想继续努力提高绩效,请选择屏幕右上角的 编辑 按钮。 然后您可以更改代码并重复该过程。 有很大的改进空间,随着你的努力,你将在排行榜上不断攀升。
Keep going
继续前进
Move on to learn about cross-validation, a technique you can use to obtain more accurate estimates of model performance!
继续学习交叉验证,这是一种可以用来获得更准确的模型性能估计的技术!